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ABSTRACT 


This  work  proposes  a  density  sensitive  distanee  measurement  that  takes  into 
aeeount  the  density  of  an  underlying  dataset  to  better  represent  the  shape  of  the  data  when 
measuring  distanee.  Kernel  density  estimation,  using  kernel  bandwidths  determined  by 
A: -nearest  neighbor  distances,  is  used  to  approximate  the  density  of  the  underlying 
dataset.  A  scale  is  applied  to  the  resulting  kernel  density  estimate  and  a  line  integral  is 
performed  along  its  surface  resulting  in  a  density  sensitive  distance.  This  work  tests  the 
utility  of  the  proposed  density  sensitive  distance  measurement  using  supervised  learning. 
k  -Nearest  Neighbor  classification  using  both  the  proposed  density  sensitive  distance 
measurement  and  Euclidean  distance  are  compared  on  the  Wisconsin  Diagnostic  Breast 
Cancer  dataset  and  the  MNIST  Database  of  Handwritten  Digits.  For  perspective,  these 
classifiers  are  also  compared  to  Support  Vector  Machine  and  Random  Forests  classifiers. 
Stratified  10-fold  cross  validation  is  used  to  determine  the  generalization  error  of  each 
classifier.  In  all  comparisons,  k  -Nearest  Neighbor  classification  using  the  proposed 
density  sensitive  distance  measurement  had  less  generalization  error  than  k  -Nearest 
Neighbor  classification  using  Euclidean  distance.  For  the  MNIST  dataset,  k  -Nearest 
Neighbor  classification  using  the  density  sensitive  distance  measurement  also  had  less 
generalization  error  than  both  Support  Vector  Machine  and  Random  Forests 
classification. 
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EXECUTIVE  SUMMARY 


This  work  proposes  a  density  sensitive  distanee  measurement  that  takes  into 
aeeount  the  density  of  an  underlying  dataset  to  better  represent  the  shape  of  the  data  when 
measuring  distanee.  Kernel  density  estimation,  using  kernel  bandwidths  determined  by 
A: -nearest  neighbor  distances,  is  used  to  approximate  the  density  of  the  underlying 
dataset.  A  scale  is  applied  to  the  resulting  kernel  density  estimate  and  a  line  integral  is 
performed  along  its  surface  resulting  in  a  density  sensitive  distance.  This  work  tests  the 
utility  of  the  proposed  density  sensitive  distance  measurement  using  supervised  learning. 
k  -Nearest  Neighbor  classification  using  both  the  proposed  density  sensitive  distance 
measurement  and  Euclidean  distance  are  compared  on  the  Wisconsin  Diagnostic  Breast 
Cancer  dataset  and  the  MNIST  Database  of  Handwritten  Digits.  For  perspective,  these 
classifiers  are  also  compared  to  Support  Vector  Machine  and  Random  Forests  classifiers. 
Stratified  10-fold  cross  validation  is  used  to  determine  the  generalization  error  of  each 
classifier.  In  all  comparisons,  k  -Nearest  Neighbor  classification  using  the  proposed 
density  sensitive  distance  measurement  had  less  generalization  error  than  k  -Nearest 
Neighbor  classification  using  Euclidean  distance.  For  the  MNIST  dataset,  k  -Nearest 
Neighbor  classification  using  the  density  sensitive  distance  measurement  also  had  less 
generalization  error  than  both  Support  Vector  Machine  and  Random  Forests 
classification. 
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I.  INTRODUCTION 


When  Operations  Speeialists,  Combat  Information  Center  Wateh  Offieers,  or 
Taetieal  Aetion  Offieers  sit  at  their  respeetive  eonsoles,  they  often  monitor  sensor  sweeps 
for  ineoming  and  outgoing  surfaee,  sub-surfaee,  and  aerial  traffie.  With  the  help  of  these 
watehstanders,  a  ship's  eombat  system  will  interpret  the  sweeps  and  produee  surfaee,  sub- 
surfaee,  or  aerial  traeks.  If  the  ship's  eombat  system  is  too  sensitive,  then  many  false 
traeks  are  produced.  If  the  ship's  combat  system  is  too  indifferent,  then  tracks  that  should 
be  produced  are  not.  Either  way,  the  majority  of  a  watchstander's  time  can  be  spent 
analyzing  whether  or  not  a  track  in  the  combat  system  is  actually  there  and  cleaning  up 
tracks  that  are  not.  Moreover,  since  these  tracks  represent  friendly,  neutral,  or  hostile 
entities,  a  great  amount  of  care  is  taken  to  ensure  that  tracks  are  classified  correctly. 
Tracks  are  analyzed  not  only  for  their  existence,  but  also  for  their  operating 
characteristics  and  signatures.  Since  a  Combat  Information  Center  would  not  want  to  fire 
on  a  friendly  force,  a  commercial  airliner,  a  fishing  boat,  or  a  cargo  ship,  a  great  amount 
of  time  is  taken  to  make  sure  that  a  hostile  track  is  actually  a  hostile  track.  The  time 
taken  to  verify  that  the  system  is  correct  is  necessary  because  the  algorithms  in  use  are 
noisy.  If  an  anti-ship  cruise  missile,  a  low-slow  flyer,  or  an  explosives-filled  wooden 
fishing  vessel  were  inbound,  then  the  watehstanders  in  that  Combat  Information  Center 
may  only  have  a  few  seconds  from  detection  to  reaction  in  order  to  avoid  being  hit. 
There  is  not  enough  time  to  verily  that  something  inbound  is  real  and  correctly  classified. 
Therefore,  new  or  revised  classification  algorithms  must  be  employed. 

The  analysis  of  the  sensor  sweeps  performed  by  the  combat  system  falls  into  the 
category  of  computer  vision  -  an  application  of  machine  learning.  The  algorithms  used 
by  the  combat  system  are  designed  to  classify  the  information  in  the  sweeps. 

The  combat  system  needs  to  know  how  to  distinguish  a  plane  from  the  sky  using  a 
three  dimensional  radar,  a  ship  from  the  sea  using  a  surface  radar,  and  a  submarine  from 
the  ocean's  floor  using  sonar  without  having  any  understanding  of  what  a  plane,  the  sky,  a 
ship,  the  sea,  a  submarine,  or  the  ocean's  floor  is.  To  run  efficiently  (and  hopefully 
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effectively),  the  combat  system  simply  needs  to  know  that  a  generic  difference  exists  and 
how  to  take  advantage  of  that  difference  to  classify  these  entities.  This  difference  is  often 
expressed  as  a  simple  distance  measurement.  Since  the  combat  system  represents  the 
physical  environment  as  data  in  some  information  space,  the  farther  apart  two  data  points 
are  that  space,  the  less  likely  the  corresponding  physical  objects  are  related. 

Once  the  combat  system  is  given  information,  (such  as  a  processed  radar  feed), 
the  system  can  pass  this  information  on  to  the  classification  algorithms.  These  algorithms 
take  in  the  unknown  information  and  efficiently  attempt  to  determine  if  that  information 
represents  a  plane,  the  sky,  a  ship,  the  sea,  a  submarine,  or  the  ocean's  floor.  Since  the 
classification  algorithms  have  been  trained  to  recognize  planes,  there  is  a  good  chance 
that  information  representing  a  plane  will  end  up  closer  to  where  the  previous 
information  about  planes  has  ended  up.  Moreover,  a  simple  distance  metric  is  often  used 
to  determine  which  class  an  unknown  piece  of  information  belongs  to.  If  the  information 
is  closer  to  planes  than  it  is  to  submarines,  then  that  information  is  probably  a  plane. 

With  this  classification  in  hand,  the  combat  system  can  perform  a  variety  of  cross- 
referencing  to  determine  if  that  newly  classified  item  is  a  friendly,  a  neutral,  a  hostile,  or 
simply  part  of  the  background. 

The  first  step  in  any  of  these  classification  routines  is  to  receive  processed  sensor 
feeds,  vice  raw  sensor  feeds.  More  often  than  not,  raw  sensor  feeds  give  little  to  no 
relevant  information  that  a  classification  algorithm  would  need  to  do  its  job.  A  raw 
sensor  feed  usually  gives  nothing  more  than  a  direct  reading,  not  how  that  reading  differs 
or  works  in  conjunction  with  previous  readings.  This  is  where  processing  comes  into 
play.  A  raw  sensor  feed  can  be  manipulated  in  order  to  emphasize  invariant  aspects  of 
the  object  of  interest.  Like  processing  an  image  from  a  digital  camera,  noise  can  be 
eliminated  or  minimized,  changes  in  intensity  can  be  determined,  and  normalization  can 
be  performed.  This  information  can  be  included  into  the  processed  feed  that 
classification  routines  receive  in  order  to  increase  the  likelihood  of  a  correct 
classification. 
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As  previously  noted,  a  distanee  metrie  is  eommonly  used  in  elassifieation 
algorithms  in  order  to  determine  how  similar  or  different  one  thing  is  from  another  in  the 
information  spaee.  By  far,  the  most  eommon  distanee  measurement  utilized  by  these 
algorithms  is  Euelidean  distanee.  Euelidean  distanee  assumes  that  two  data  points  are 
similar  based  on  their  proximity  to  eaeh  other,  without  regard  to  the  other  things  around 
them.  In  other  words,  if  we  needed  to  determine  that  something  is  a  ship  or  the  sea  in  the 
information  space,  then  Euclidean  distance  would  simply  determine  which  was  closer, 
without  regard  to  the  density  of  the  ships  or  the  sea.  We  could,  however,  take  into 
account  the  density  of  ships  and  the  sea  in  order  to  achieve  a  density-sensitive  distance 
measurement.  In  other  words,  if  the  ships  were  the  red  dots,  the  seas  were  the  blue  dots, 
and  a  black  dot,  equally  far  away  from  either  the  ships  or  the  seas,  was  a  new  piece  of 
information  that  we  just  received  (as  in  Eigure  1),  then  Euclidean  distance  would 
arbitrarily  classify  that  new  piece  of  information  because  the  densities  of  the  classes 
being  dealt  with  are  not  taken  into  account. 
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Eigure  1 .  A  piece  of  information  (the  black  dot)  equally  far  away  from  two  classes  with 

different  densities  (the  blue  and  red  dots). 

Since  the  seas  (the  blue  dots)  are  much  more  dispersed  than  the  ships  (the  red 
dots)  in  the  information  space  above,  then  it  can  be  argued  that  in  order  for  the  new  piece 
of  information  (the  black  dot)  to  be  considered  a  ship  (red),  it  ought  to  be  as  close  to  the 
rest  of  the  ships  as  all  the  ships  are  to  each  other.  In  other  words,  for  this  new  piece  of 
information  (the  black  dot)  to  be  classified  as  a  ship  (red),  it  ought  to  mimic  the  level  of 
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dispersion  of  the  previously  eneountered  ships  (the  red  dots).  Moreover,  sinee  the 
position  of  this  new  information  (the  blaek  dot)  in  the  information  space  is  more 
consistent  with  the  density  of  the  seas  (the  blue  dots),  then  it  makes  more  sense  to 
classify  this  new  information  as  the  sea  (blue). 

Therefore,  we  need  a  distance  that  is  sensitive  to  the  density  of  the  data  over 
which  it  will  measure.  Moreover,  this  density  sensitive  distance  should  be  able  to  take 
measurements  over  any  set  or  subset  of  data,  regardless  of  class. 
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II.  RELATED  WORK 


The  creation  of  this  density  sensitive  distance  measurement  will  be  based  on 
previous  work  regarding  kernel  density  estimates,  distance  measurements,  and  principal 
component  analysis. 

A.  KERNEL  DENSITY  ESTIMATES 

Although  there  are  many  different  kernel  density  estimates,  the  key  ones  that  this 
work  most  relied  on  are  Parzen  Windows  and  Manifold  Parzen  Windows. 


1.  Parzen  Windows  (Parzen,  1962) 


Parzen  Windows  is  a  method  of  estimating  the  probability  density  function  fi^x 
of  a  random  variable  X  from  sample  data  generated  by  that  random  variable. 

The  Parzen  Windows  method  centers  a  weighting  function  K  with  a  common  width  h 
on  top  of  each  sample  x.  in  the  dataset  and  then  adds  up  a  scalar  multiple  of  all  those 

functions  to  produce  an  estimate  of  the  underlying  probability  density  function  (xj  . 
Thus,  the  univariate  Parzen  Windows  estimate  of  the  probability  density  function  is 


/  [x]  =  —^-K 
>  mj^h 


X  —  X. 


where  m  is  the  number  of  points  in  the  sample  dataset,  is  an  estimate  of  the 

probability  density  function  fi^x^  using  m  samples,  h  is  the  common  width  of  the 
weighting  function  K ,  x As  the  i -th  point  in  the  sample  dataset  for  i  =  ,  and  K 

is  the  weighting  function.  For  a  variety  of  reasons,  the  weighting  function  K  must 
satisfy: 

j  K[x^dx  =  1 . 
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The  weighting  funetion  K ,  also  ealled  the  kernel  funetion,  ean  take  many  forms; 
however,  one  of  the  more  popular  weighting  functions  is  the  following: 
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This  weighting  function  satisfies 


1  such  that  becomes 
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which  is  the  normalized  sum  of  univariate  normal  distribution  probability  density 
functions  where  x.  is  the  mean  of  the  z-th  function  and  h  is  the  common  standard 

l 

deviation. 


For  multivariate  Parzen  Windows  where  widths  are  allowed  to  vary  along  each 
dimension,  the  estimate  of  the  probability  density  function  becomes 
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where  m  is  the  number  of  samples  in  the  dataset,  x  is  an  n  dimensional  variable  with 
components  x^  for  j  =  1, . . . ,  rz ,  j  is  an  estimate  of  the  probability  density  function 

/  (xj  using  m  samples  of  n  dimensions,  is  the  width  of  the  weighting  function  K 

that  is  common  along  the  j  -th  dimension,  x^'^  is  the  j  -th  component  of  i  -th  sample 
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from  the  dataset  for  z  =  1, . . . ,  m  and  j  =  1, . . . ,  n  ,  and  K  is  the  weighting  funetion 
(Wasserman,  2007). 

Moreover,  if  the  widths  are  allowed  to  e ovary  along  all  dimensions,  then  the 
Parzen  Window  estimate  is 


1 

1—* 

I* 

1—^ 

X  —  X. 

1 

h 

K  / 

(Alpaydin,  2004).  For  similar  reasons  as  the  univariate  ease,  the  multivariate  weighting 
funetion  K  must  satisfy  the  following: 


00  oo 

—  OO  — oo 


When  h  =  1,  the  multivariate  weighting  funetion  if  =  ean  be  the  following: 


if=  (x 

E 


27r 


where  n  is  the  dimension  of  the  variable  x ,  S  is  the  nxn  eovarianee  matrix  of  the 

sample  data,  |s|  is  the  determinant  of  that  eovarianee  matrix,  and  is  the  inverse  of 
that  eovarianee  matrix.  Note  that  here  x  is  assumed  to  be  a  rz  x  1  eolumn  veetor  sueh 
that  the  transpose  of  that  eolumn  veetor  x^  results  in  a  1  x  n  row  veetor.  This 

OO  oo 

multivariate  weighting  funetion  satisfies  /  •/  K  ^xjdXj  •  •  •  dx^  =  1  sueh  that  j 
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Figure  2.  An  example  of  Parzen  Windows. 


Figure  2  is  an  example  in  Parzen  Windows  (Vincent  &  Bengio,  2002).  On  the 
left,  we  have  a  sample  dataset.  On  the  right,  we  have  an  estimate  of  the 

underlying  probability  density  function  using  the  multivariate  weighting  function  if = 

with  diagonal  covariance  matrix  S  with  equal  variance  in  all  dimensions  (i.e.,  a 
"spherical"  bivariate  normal). 

For  Parzen  Windows,  one  should  recognize  that  the  same  width  h  is  used  for 
every  weighting  function  K  that  is  placed  on  top  of  each  sample  in  the  dataset.  Once  the 
width  h  is  determined,  it  remains  constant  throughout  the  rest  of  the  Parzen  Windows 
method;  hence,  the  width  h  is  derived  from  information  that  is  global  to  the  entire 
dataset,  vice  locally  adapting  to  the  sample  data.  This  makes  the  width  h  difficult  to 
determine  for  some  datasets.  For  those  troublesome  datasets,  h  may  be  too  wide  at 
certain  positions  in  the  dataset  and  too  narrow  at  other  positions  in  the  same  dataset  to 
accurately  estimate  the  true  density  of  the  sample  dataset. 

Parzen  Windows  using  diagonal  covariance  matrix  S  with  equal  variance  in  all 
dimensions  (i.e.,  "spherical"  multivariate  Normal)  is  the  kernel  density  estimate  that  will 
be  used  for  the  proposed  density  sensitive  distance  measurement. 
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2. 


Manifold  Parzen  Windows  (Vincent  &  Bengio,  2002) 


Manifold  Parzen  Windows  assumes  that  a  sample  dataset  is  produeed  from  a 
lower  dimensional  manifold.  Loeally  adapting  weighting  funetions  K.  are  used  to 

estimate  that  manifold.  Manifold  Parzen  Windows  infers  the  loeal  direetion  of  the 
underlying  manifold  by  ealeulating  the  eigenvalues  (i.e.,  varianees)  and  eorresponding 

eigenveetors  (i.e.,  the  direetions  of  varianee)  assoeiated  with  the  eovarianee  of  a 

neighborhood  around  eaeh  sample  point.  Larger  varianees  are  assumed  to  be  assoeiated 
with  the  directions  tangent  to  the  manifold  in  order  to  account  for  the  manifold.  Smaller 
variances  are  assumed  to  be  associated  with  the  directions  normal  to  the  manifold  in 
order  to  account  for  the  noise  off  the  manifold.  A  weighting  function  K.  using  that  local 

covariance  matrix  S.  is  then  placed  over  the  sample  point  x. ;  hence,  a  different  local 

covariance  matrix  S  is  used  for  each  sample  x  . 

For  Manifold  Parzen  Windows,  the  neighborhood  used  to  calculate  the  local 
covariance  matrix  for  a  given  sample  point  can  be  a  hard  k  -neighborhood,  vice  a  range. 

In  other  words,  we  can  compute  the  local  covariance  matrix  associated  with  a  sample 

point  X .  by  considering  the  k  -nearest  neighbors  of  that  sample  point.  However,  if  k  is 

less  than  the  dimension  of  our  data  (i.e..  A:  <  n  )  or  if  the  k  -nearest  neighbors  do  not  span 
the  dimension  of  our  data  (i.e.,  the  A: -nearest  neighbors  exist  in  a  subspace  of  our  n- 

dimensional  dataset),  then  the  resulting  covariance  matrix  S.  would  be  singular  and  not 

invertible.  Therefore,  an  epsilon  e  of  variance  is  also  added  along  each  dimension  in 
order  to  guarantee  non-singularity  and  invertibility. 

Lastly,  we  can  reduce  some  of  the  computational  complexity  of  Manifold  Parzen 
Windows  by  only  considering  the  estimated  dimension  of  the  underlying  manifold.  The 
dimension  of  the  underlying  manifold  can  be  estimated  by  only  considering  the 

eigenvectors  associated  with  the  largest  eigenvalues  of  each  local  covariance  matrix  . 
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These  eigenvectors  are  estimates  of  the  principal  directions  of  the  local  manifold. 
Moreover,  the  eigenvectors  associated  with  the  smallest  eigenvalues  of  each  local 

covariance  matrix  S  are  the  directions  of  noise  off  the  manifold.  Therefore,  we  can 
discard  the  eigenvectors  of  noise,  retain  our  eigenvectors  of  principal  direction,  and 
reduce  our  local  covariance  matrices  S  down  to  a  dimension  that  better  accounts  for  the 

I 

manifold  producing  our  data. 

Therefore,  a  Manifold  Parzen  Window  estimate  of  the  probability  density 
function  using  a  locally  adapting  multivariate  normal  weighting  function  N  =  is: 

LL  ,S 


where  m  is  the  number  of  points  in  the  sample  dataset,  (^xj  is  an  estimate  of  the 
probability  density  function  fix]  using  m  samples,  ft  is  the  z-th  point  in  the  sample 


dataset  for  z  =  1, . . . ,  m  (previously  called  x , ),  S .  is  the  z  -th  locally  calculated 
covariance  matrix,  and  N_  =  is  the  locally  adapting  multivariate  normal  weighting 
function  defined  as 


27r 


where  N  =  is  defined  similarly  to  the  multivariate  normal  weighting  function  in  the 
Parzen  Windows  section. 
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Figure  3.  An  example  of  Manifold  Parzen  Windows. 


Figure  3  shows  an  example  in  Manifold  Parzen  Windows  from  (Vineent  & 
Bengio,  2002).  On  the  left,  we  have  a  sample  dataset.  On  the  right,  we  have  an  estimate 

of  the  underlying  probability  density  funetion  using  N_  =  with  loeal 

eovarianee  matriees  ealeulated  using  the  10  -nearest  neighbors  of  eaeh  sample  point. 

Figure  3  should  be  eompared  to  Figure  2  in  order  to  see  how  Manifold  Parzen  Windows 
differ  from  Parzen  Windows. 

For  the  proposed  density  sensitive  distanee  measurement,  we  will  use  A:-th 
nearest  neighbor  distanees  to  determine  the  bandwidth  of  the  kernel  used  in  the  density 
estimate  of  a  dataset  similar  to  the  way  that  Manifold  Parzen  Windows  used  the  k  - 
nearest  neighbors  to  determine  eaeh  loeal  eovarianee  matrix. 

B.  DISTANCE  MEASUREMENTS 

The  proposed  density  sensitive  distanee  measurement  will  be  designed  to  be  a 
loeally  weighted  Euelidean  distanee,  one  of  the  Minkowski  distanees.  Moreover,  the 
aspects  of  other  distance  measurements,  namely  the  Mahalanobis  distance  and  Wang  et 
al.'s  Density  Sensitive  Distance  Metric  (Wang,  et  ah,  2006),  will  also  impact  this  density 
sensitive  distance  measurement. 
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1. 


Minkowski  Distances 


The  Minkowski  distance  metrics  include  the  Manhattan,  Euclidean,  and 
Chebyshev  distances  (Zezula,  2006).  The  generic  form  of  the  Minkowski  distance  metric 
is  the  following; 


distance  = 


E 


x:  '  —  x: 


where  G  M  is  the  power  of  the  metric,  x^°^  is  the  initial  point  (the  source  point),  x^^^  is 
the  final  point  (the  destination  point),  and  n  is  the  shared  dimension  of  the  points. 


a.  Manhattan  Distance  ( City-Block  Distance) 
Manhattan  distance  takes  the  following  form: 
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distance^  = 

/ 

n  ,  X 

E  - 

W 

x^.  ’ 

% 

= 

xf'-xW 

+  •■•  + 

n  n 

(Zezula,  2006)  and  has  the  unit  circle  detailed  in  Figure  4;  hence,  this  metric  is  not 
invariant  to  rotation  (Samet,  2006). 


Figure  4.  The  unit  circle  of  Manhattan  distance. 
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b.  Euclidean  Distance 


Euclidean  distance  takes  the  following  form; 


distance  = 


E 

i=l 


(/)  (' 
x\  '  —  X^. 


=  Lw_,wf +...  +  (,(/) 


n  n 


=  .,|x('^)-x(“)  x(''')-xW 


(Zezula,  2006)  and  has  the  unit  cirele  detailed  in  Figure  5;  henee,  Euclidean  distance  is 
invariant  to  rotation  (Samet,  2006). 


Figure  5.  The  unit  eirele  of  Euclidean  distanee. 


Moreover,  since  the  linearly  interpolation  from  x^°^  to  x^^^  is 
(^)  ^  where  0  <  t  <  1  and  dx^_^^  [t^ldt  =  x^'^^  —  x^"^ ,  then 

Euelidean  distanee  can  be  rewritten  as  the  following; 


distance^ 
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c.  Chebyshev  Distance  (Chessboard  Distance) 
Chebyshev  distance  has  the  following  form; 


distance  (x^°^ 

oo  \ 

(Zezula,  2006)  and  has  the  unit  eircle  detailed  in  Figure  6;  hence,  Chebyshev  distance  is 
not  invariant  to  rotation  (Samet,  2006). 
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Figure  6.  The  unit  circle  of  Chebyshev  distance. 


None  of  the  Minkowski  distances  take  the  underlying  dataset  into  aeeount  when 
performing  their  metrie;  hence,  the  distance  looks  at  the  dataset  uniformly.  Moreover, 
the  Minkowski  distances  do  not  offer  different  weights  to  different  pairs  of  points;  hence, 
these  Minkowski  distances  are  not  locally  weighted. 

2,  Mahalanobis  Distance  (Mahalanobis,  1936) 

One  reaction  to  the  dataset-independent,  unity-weighted  Euclidean  distance  is 

Mahalanobis  distance.  Mahalanobis  distance  takes  into  aeeount  the  global  eovariance  S 
of  a  dataset  and  weights  each  distance  based  on  this  covariance.  Mahalanobis  distance 
takes  the  following  form; 
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Figure  7.  The  data  dependent  unit  circles  of  Mahalanobis  distance. 


For  Mahalanobis  distance,  the  unit  for  the  unit  circle  is  one  standard  deviation  in 
the  direction  of  each  principal  component;  hence,  the  unit  circle  for  Mahalanobis  is  a  data 
dependent  ellipsoid  whose  radii  are  one  standard  deviation  in  each  of  the  principal 
component  directions. 

While  Mahalanobis  distance  is  data  dependent  and  takes  into  account  the  global 
variance  of  the  dataset,  it  does  not  take  into  account  the  local  densities  of  the  dataset. 
Moreover,  Mahalanobis  distance  offers  no  advantage  over  Euclidean  distance  when  the 
global  variances  in  each  of  the  principal  directions  are  equivalent  (as  in  the  left  of  Figure 
V). 
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3,  Density  Sensitive  Distance  Metric  (Manifold  Distance)  (Wang  et  aL, 
2006) 


The  Density  Sensitive  Distanee  Metrie  of  Ling  Wang,  Liefeng  Bo,  and  Lieheng 
Jiao  has  the  following  form: 


Let  data  points  be  the  nodes  of  graph  G  =  {V,E)  and  p  G  be  a  path  of  length  £  = 


P 


eonnecting  the  initial  point  to  the  final  point  x^'^^  =  in  whieh 


^(k)  ^(k+i) 

x'  \x'  ' 


G  -E  for  1  <  A:  < 


P 


.  Let  denote  the  set  of  all  paths  connecting  the 


initial  point  x^”^  to  the  final  point  x^^' .  The  density  sensitive  distance  metric  between  two 
points  is  defined  to  be 


|p|-i 

V 


distance,  x^‘’,x^'^^  =  min  distance,  x'  ,x 

density  \  ’  /  „pp,,  ,  ,  density  \  ’ 

sensitive  '  '  ^  Jo)  J/)  k=l  adjusted 

length 


k)  jk+i) 


where  the  density  adjusted  length  of  a  line  segment  is  defined  to  be 


distance^^.^^  (  1  =  ^distanee,)#). xP'))  _  ^ 

adjusted 
length 


where  p  G  M  is  the  flexing  factor  for  p  >  1  and  distance^  j  is  Euclidean 

distance  between  x^*"^  and  . 

The  length  of  the  line  segment  between  x^*"^  and  can  be  scaled  by  adjusting 
the  flexing  factor  p .  As  detailed  in  their  paper,  "the  density-sensitive  distance  metric  can 
measure  the  geodesic  distance  along  the  manifold,  which  results  in  any  two  points  in  the 
same  region  of  high  density  being  connected  by  a  lot  of  shorter  edges  while  any  two 
points  in  different  regions  of  high  density  are  connected  by  a  longer  edge  through  a 
region  of  low  density."  (Wang  et  ah,  2006)  Hence,  the  Density  Sensitive  Distance  Metric 
of  Ling  Wang,  Liefeng  Bo,  and  Lieheng  Jiao  allows  the  distance  along  the  path  of 
afedeb  to  be  shorter  than  the  path  of  ab  (as  in  Figure  8)  as  opposed  to  Euclidean 
distance. 
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Figure  8.  af  +  fe  +  ed  +  dc  +  cb  <  ab  (Wang  et  al,  2006) 

Unfortunately,  the  Density  Sensitive  Distance  Metric  of  Ling  Wang,  Liefeng  Bo, 
and  Licheng  Jiao  has  a  high  computational  cost  since  it  assumes  a  complete  graph  over 
the  entire  dataset  and  then  computes  the  shortest  path  between  each  pair  of  points.  The 
Density  Sensitive  Distance  Metric  of  Ling  Wang,  Liefeng  Bo,  and  Licheng  Jiao 
accomplishes  everything  and  more  that  our  proposed  density  sensitive  distance 
measurement  attempts  to  accomplish;  however,  the  proposed  density  sensitive  distance 
measurement  will  attempt  to  reduce  the  computational  cost  of  the  Density  Sensitive 
Distance  Metric  of  Ling  Wang,  Liefeng  Bo,  and  Licheng  Jiao  (i.e.,  the  cost  of  calculating 
the  shortest  path  in  the  complete  graph)  by  restricting  our  measure  to  the  straight  line 

path  from  the  initial  point  to  the  final  point  while  traveling  over  a  kernel  density 
estimation. 

C.  PRINCIPAL  COMPONENT  ANALYSIS  (PCA) 

Principal  Component  Analysis  (PCA)  is  the  linear  projection  that  minimizes  the 
mean  squared  distance  between  data  points  and  their  projections  (Bishop,  2007). 
Principal  Component  Analysis  decomposes  a  dataset  (usually  through  the  dataset's 
covariance  matrix)  into  a  set  of  eigenvalues  and  eigenvectors  that  represents  the 
directions  of  highest  to  lowest  variance  along  an  orthonormal  basis  where  the  principal 
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eigenvector  points  in  the  direction  of  the  highest  variance  and  all  other  eigenvectors  are 
orthogonal  and  point  in  the  directions  of  the  next  highest  variance. 

For  this  work,  PCA  will  be  used  to  extract  the  maximum  "lateral"  variance  (i.e., 

the  maximum  eigenvalue  of  the  covariance  matrix  S)  for  each  dataset  in  order  to 
determine  a  scale  7  applied  to  a  kernel  density  estimate. 
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III.  DENSITY  SENSITIVE  DISTANCE  MEASUREMENT 


A,  DEFINITION 


Let  (t )  =  (1  —  ^  +  (t)x^  '  be  the  linear  interpolation  from  an  initial  point 


x^°^  (the  source  point)  to  a  final  point  x^^'  (the  destination  point)  in  n  -dimensions  and 


?/  X  =  7 


E'f. 


be  the  scaled  kernel  density  estimate  of  the  dataset  over  which 


distances  will  be  measured  where  7  G  M  is  the  scale  (or  gain)  of  y  ,  m  is  the  number  of 
data  points  in  the  dataset,  and  K.  ^xj :  M”  ^  M  is  the  kernel  function  centered  at  the  z-th 

data  point  in  that  dataset  (i.e.,  y  is  the  sum  of  kernel  values  at  x  where  a  kernel  is  placed 
at  every  data  point  in  the  entire  dataset).  Then,  the  density  sensitive  distance 
measurement  we  are  proposing  is 


distance 


dt 


where  dt  =  x^^'  —  x^“^  is  the  derivative  of  the  linear  interpolation  with  respect 


to  t  and  dy[x^_^^  [t]]jdt  is  the  derivative  of  y[x]  as  it  travels  from  the  initial  point  x 


(/) 

to  the  final  point  x'  ' . 

The  kernel  function  that  will  be  used  in  this  work  is  the  probability  density 
function  for  the  spherical  multivariate  Normal  distribution  given  by: 


-exp 


27r  r  cr" 


1^1^ 

2  ^ 

^  j=i 


a 
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where  a  is  the  radius  of  the  sphere  whieh  is  covered  by  the  kernel  (detailed  in  Appendix 
B),  T  ’  is  the  i -th  data  point  in  the  dataset  with  components  xy  for  j  =  ,  and  n 

is  the  dimension  of  the  dataset. 

Therefore,  this  density  sensitive  distance  measurement  is  the  line  integral  from 

to  as  it  travels  along  the  surface  of  the  scaled  kernel  density  estimation  of  a 
dataset.  Each  dataset  (or  data  subset)  will  most  likely  have  a  different  kernel  density 

estimation  and  the  line  integral  from  x^”^  to  x^^^  will  be  sensitive  to  the  local  density  of 
the  data  over  which  it  will  measure. 

B,  PURPOSE 

The  purpose  of  the  density  sensitive  distance  measurement  is  to  take  into  account 
the  density  of  a  set  of  data  when  determining  how  similar  any  given  point  is  to  the 
dataset.  If  the  set  of  data  is  highly  concentrated,  then  a  point  that  is  part  of  that  set  should 
be  at  locations  that  mimic  that  concentration  or  else  a  penalty  should  be  incurred. 
Similarly,  if  a  set  of  data  is  greatly  dispersed,  then  a  point  that  is  part  of  that  set  should 
also  be  at  positions  that  imitate  that  level  of  dispersion  or  a  similar  price  should  be  paid. 

C.  PARAMETERS 

Based  on  the  definition  of  this  density  sensitive  distance  measurement,  there  are 
two  parameters  that  need  to  be  determined  before  a  measurement  can  be  taken:  the 
kernel  bandwidth  and  the  scale.  For  the  kernel  selected,  the  parameter  that  determines 
the  kernel  bandwidth  is  cr .  The  scale  is  determined  by  7  . 

1.  Kernel  Bandwidth 

There  are  many  ways  to  determine  the  optimal  kernel  bandwidth  when  the 
distribution  of  a  dataset  in  known  or  suspected.  However,  when  the  underlying 
distribution  that  generates  the  dataset  is  unknown,  determining  the  bandwidth  of  the 
kernel  becomes  a  matter  of  perspective.  We  are  free  to  choose  a  small  bandwidth  to 
show  roughness  in  the  data.  We  are  also  free  to  choose  a  large  bandwidth  to  show 
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smoothness  in  that  same  data.  In  other  words,  when  we  do  not  know  what  the  underlying 
distribution  is,  then  there  is  nothing  for  us  to  optimize  against  and  we  are  free  to  ehoose 
the  kernel  bandwidth  that  best  biases  our  results. 

This  is  analogous  to  viewing  a  painting  in  a  gallery.  When  the  painter  is  not 
present  to  aetively  form  the  opinion  of  a  patron  by  telling  the  patron  where  to  stand  and 
what  to  look  for,  then  the  patron  must  form  his/her  own  opinion  of  the  work.  Some 
patrons  may  ehoose  to  stand  elose  to  the  painting  to  view  the  detail  of  eaeh  brush  stroke. 
Some  patrons  may  stand  baek  to  view  the  entire  work  as  a  whole  without  delving  into  any 
of  its  detail.  And  some  may  seareh  for  a  happy  medium  between  the  two.  Almost  all  the 
patrons  will  assign  meaning  to  some  portion  of  the  work  that  was  unintended  by  the 
painter.  However,  the  perspeetive  of  every  patron  is  valid  even  though  eertain 
perspeetives  may  eonfliet.  This  freedom  of  perspeetive  allows  patrons  to  see  what  they 
want  to  see. 

Likewise,  determining  the  kernel  bandwidth  of  an  unknown  distribution  is  an 
exereise  in  perspeetive.  Sinee  we  do  not  know  if  we  have  enough  data  to  absolutely 
assert  one  distribution  over  another  and  sinee  we  eannot  be  eompletely  eertain  that  the 
available  data  is  a  true  random  sample  from  the  greater  population  (sinee  we  do  not  know 
the  greater  population),  the  kernel  bandwidth  ean  be  redueed  to  a  matter  of  perspeetive. 
For  our  purposes,  we  know  the  kernel  we  are  using,  the  probability  density  funetion  of 
the  spherieal  multivariate  Normal  distribution,  has  a  strong  additive  effeet  when  the 
eenters  of  two  or  more  of  these  kernels  are  within  2cr  of  eaeh  other  (as  in  Figure  9).  This 
additive  effeet  eauses  the  overall  kernel  density  estimation  to  appear  smooth.  In  this 
ease,  a  kernel  density  estimate  appears  smooth  when  the  number  of  extrema  in  the 
estimate  are  redueed  to  the  relevant  extrema,  the  extrema  that  best  eonform  to  the  density 
we  want  to  see.  However,  if  too  many  kernel  eenters  are  within  2cr  of  eaeh  other,  then 
there  is  too  mueh  of  an  additive  effeet  and  the  resulting  kernel  density  estimation  is 
overly  smooth  (as  in  Figure  10).  In  this  ease,  a  kernel  density  estimate  is  overly  smooth 
when  extrema,  believed  to  eonform  to  the  density  we  want  to  see,  are  eliminated  by  the 
additive  effeetive  of  the  kernels. 
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Figure  9.  The  additive  effeet  that  smoothes  the  kernel  density  estimation  when  kernel 

eenters  are  within  2a  of  eaeh  other.  Flere,  <7  =  1. 


y  y 


Figure  10.  Over-smoothing  the  kernel  density  estimation  when  too  many  kernel  centers 

are  within  2a  of  each  other.  Here,  a  =  2. 

Since  we  arbitrarily  desire  the  kernel  density  estimation  to  appear  smooth,  but  not 
too  smooth,  then  we  will  take  advantage  of  this  additive  effect  and  choose  a  a  that 
causes  groups  of  centers  to  be  within  a  of  each  other.  To  avoid  over-smoothing,  we 
collect  the  distances  of  k  -th  nearest  neighbor  from  each  datum  in  the  dataset,  bin  these 
distances,  and  choose  a  that  corresponds  to  the  middle  distance  associated  with  the  bin 
that  holds  the  maximum  number  of  these  k  -th  nearest  neighbor  distances.  If  we  choose 
k  to  be  small  compared  to  the  size  of  a  dataset,  then  our  a  will  be  small  as  well  and  the 
kernel  density  estimation  will  be  rougher.  If  we  choose  k  to  be  large  compared  to  the 
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size  of  a  dataset,  then  our  a  will  be  large  as  well  and  the  kernel  density  estimation  will 
be  smoother.  We  found  that  starting  with  a  k  that  is  approximately  15%  of  the  size  of 
the  dataset  yields  aeceptable  results  during  eross-validation  when  maximizing  aceuraey, 
preeision,  recall,  or  various  combinations. 

Since  collecting  the  pair-wise  distances  can  be  quite  expensive  when  the  size  of 
the  dataset  is  high,  we  can  treat  the  distances  between  each  datum  in  a  dataset  as  a 
population.  Moreover,  since  we  know  and  have  access  to  the  entire  population  of  pair¬ 
wise  distances,  we  can  randomly  sample  these  distances  in  order  to  come  up  with  an 
acceptable  kernel  bandwidth.  For  this  work,  when  a  dataset  contains  over  1000  data 
points,  we  first  randomly  sample  up  to  1000  data  points  from  that  dataset,  find  the  pair¬ 
wise  distances  between  those  randomly  sampled  points,  and  complete  the  previously 
described  to-avoid-over-smoothing  routine  above  on  this  random  sample. 

Note:  As  a  ^  oo,  the  kernel  density  estimate  approaches  a  fiat  plane  and  this 
density  sensitive  distance  measurement  approaches  Euclidean  distance. 

2.  Scale 

As  with  kernel  bandwidth,  there  are  many  ways  to  determine  scale.  Scale  7 
effects  the  amplitude  (or  gain)  of  the  kernel  density  estimation  for  a  dataset  (as  in  Figure 
11).  If  7  >  1,  then  the  kernel  density  estimate  will  be  amplified  (i.e.,  the  gain  will  be 
turned  up)  and  the  existing  dataset  will  produce  y  -values  with  a  higher  variance  (as  in 
Figure  12).  If  0  <  7  <  1,  then  the  amplitude  of  kernel  density  estimate  will  be  reduced 
(i.e.,  the  gain  will  be  turned  down)  and  the  existing  dataset  will  produce  y  -values  with 
less  variance  (as  in  Figure  13). 
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Figure  11.  The  kernel  density  estimation  with  7  =  1  for  two  separate  datasets. 


Figure  12.  Doubling  the  scale  associated  with  a  dataset.  Left,  7  =  1  for  the  blue  kernel 
density  estimate  and  7  =  2  for  the  red  kernel  density  estimate.  Right,  7  =  2  for 
the  blue  kernel  density  estimate  and  7  =  1  for  the  red  kernel  density  estimate. 
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Figure  13.  Flalving  the  seale  assoeiated  with  a  dataset.  Left,  7  =  1  for  the  blue  kernel 
density  estimate  and  7  =  1^2  for  the  red  kernel  density  estimate.  Right,  7  =  1^2 
for  the  blue  kernel  density  estimate  and  7  =  1  for  the  red  kernel  density  estimate. 

Moreover,  7  ean  be  used  to  change  the  variance  of  the  y  -values  of  a  dataset. 
Since  we  know  that  variance  can  be  estimated  by  the  following  equation; 


\2 


then  7  can  change  the  variance  estimate  of  the  y  -values  by  the  following: 
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The  line  integral  for  this  density  sensitive  distanee  measurement  can  be 
implemented  using  at  least  two  methods:  local  adaptive  quadrature  on  the  integrand 
(Burden  &  Faires,  2005)  or  local  adaptive  Euclidean  distance  on  the  scaled  kernel  density 
estimation.  Local  adaptive  quadrature  can  be  faster;  however,  the  derivative  of 

must  be  calculated.  If  that  is  not  desirable  or  even  possible,  then  we  can  use  local 
adaptive  Euclidean  distance  directly  on  the  scaled  kernel  density  estimation  (as  in  Eigure 
14).  Eor  local  adaptive  Euclidean  distance,  we  simply  choose  various  points  along  the 
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path  from  to  ,  calculate  their  respeetive  sealed  kernel  density  estimations,  and 
measure  the  Euelidean  distanee  from  point  to  point.  Then  we  divide  eaeh  implied  line 
segment  in  two  and  take  the  Euelidean  distanee  of  those  two  new  segments.  We  iterate 
until  the  ehange  in  distanee  between  the  whole  segment  and  the  two  half  segments  is  less 
than  an  established  threshold.  When  the  distanees  of  all  the  segments  have  been 
calculated,  then  the  sum  of  all  those  segment  distances  approximates  the  line  integral 
distanee. 


y 


y 


y 


y 
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Eigure  14.  Sueeessive  iterations  of  local  adaptive  Euclidean  distanee  on  the  sealed  kernel 
density  estimate  from  —4  to  4  in  order  to  approximate  the  line  integral  from  —4 

to  4 . 
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E. 


STRENGTHS 


The  main  strength  of  this  density  sensitive  distanee  measurement  is  that  it  takes 
into  account  the  density  of  the  dataset  over  which  it  measures.  In  so  doing,  the  density 
sensitive  distance  measurement  provides  a  more  shape-conforming  distance  for 
classification  than  Euclidean  distance  alone.  For  instance,  if  we  only  look  to  the 
immediate  nearest  neighbor  using  Euclidean  distance  for  classification,  then  this  is 
equivalent  to  the  classifying  a  point  based  on  its  location  in  the  Voronoi  diagram.  For  the 
circular  datasets  used  in  the  Introduction,  we  would  have  a  star  shaped  pattern  for 
classification,  vice  a  circular  one  (as  in  Figure  15  and  Figure  16).  However,  if  we  let 
u ,  be  the  y  -value  variance  for  the  blue  class,  v  ^  be  the  y  -value  variance  for 

blue  y  ^  ’  red  y  ^ 

variance  variance 


the  red  class,  and  u , 

’  blue  max 

lateral  variance 


and  V  j  be  the  maximum  lateral  variances  for  the 

red  max 
lateral  variance 


blue  and  red  classes,  respectively,  then  we  can  look  to  the  immediate  nearest  neighbor 
using  the  proposed  density  sensitive  distance  for  classification  and  achieve  much  more 
shape-conforming  results  (as  in  Figure  17  and  Figure  18). 


Figure  15.  The  Voronoi  diagram  and  1 -nearest  neighbor  classification  using  Euclidean 

distance  on  the  datasets  from  the  Introduction. 
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Figure  16.  Close  up  of  1-nearest  neighbor  classifieation  using  Euclidean  distance  on  the 

datasets  from  the  Introduction. 


Figure  17.  1-nearest  neighbor  classification  using  density  sensitive  distance  with 
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Figure  18.  1 -nearest  neighbor  elassifieation  using  density  sensitive  distanee  with 
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on  the  datasets  from  the  Introduetion. 


F.  WEAKNESSES 

The  density  sensitive  distance  measurement  is  not  a  distance  metric.  In  order  for 
a  measurement  to  be  considered  a  metric,  the  triangular  inequality  must  hold  (Zezula, 
2006).  In  other  words,  to  be  a  metric,  the  following  property  must  hold: 

\/x,y,z  G  5, distance <  distance +  distance^?/, 2; j 

The  triangular  inequality  does  not  hold  for  this  density  sensitive  distance  measurement. 
For  instance,  given  ^—4,  — 4 j ,  ^4,  — 4 j ,  and  ^4, 4 j  and  the  scaled  kernel  density  estimate 
in  Figure  19,  we  would  have  the  following: 

distance  ((—4,  — 4j ,  ^4, 4j  j  >distance  ((—4,  — 4j ,  ^4,  — 4jj  +  distance  ^^4,  — 4j ,  ^4, 4jj 

for  density  sensitive  distance.  Flence,  the  triangle  inequality  does  not  hold  for  this 
density  sensitive  distance  measure.  Therefore,  the  density  sensitive  distance 
measurement  is  a  measurement,  not  a  metric. 
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Figure  19.  Different  perspeetive  views  on  the  same  sealed  kernel  density  estimation  that 
demonstrate  that  the  triangle  inequality  does  not  hold  for  this  density  sensitive 

distance  measurement. 


Additionally,  if  we  implement  this  density  sensitive  distance  measurement  using 
either  local  adaptive  quadrature  or  local  adaptive  Euclidean  distance,  then  we  need  to  be 
conscious  of  the  fact  that  a  poor  choice  in  where  a  line  segment  is  broken  can  lead  to 
incorrect  results  when  calculating  the  line  integral  (as  in  Figure  20).  If  a  line  segment  is 
broken  and  the  difference  between  the  length  of  the  original  line  and  the  lengths  of  the 
resulting  two  line  segments  is  under  a  threshold,  then  locally  adaptive  routines  assume 
they  have  adapted  to  their  goal  with  a  specified  tolerance.  If  they  have  not  properly 
adapted,  then  the  locally  adaptive  routine  will  return  incorrect  results  for  the  line  integral. 
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Figure  20.  A  poor  choice  for  the  break  in  a  line  segment  that  will  stop  further  locally 
adaptive  line  segments  from  being  generated  in  the  computation  of  the  line 

integral  from  —4  to  4  . 

Also,  even  though  improvements  have  been  made  in  shape-conforming 
classification,  the  potential  exists  to  get  odd  classification  results  from  certain 
combinations  of  kernel  bandwidth  and  scale  (as  in  Figure  21).  In  Figure  21,  if  a  datum 
falls  in  line  with  regions  that  we  would  usually  consider  to  blue  or  red,  then  everything  is 
as  expected;  however,  if  a  datum  is  an  outlier  and  falls  far  enough  out  of  the  traditional 
boundaries,  then  the  datum  is  classified  as  red,  even  though  blue  is  closer  from  a 
Euclidean  distance  point  of  view. 
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Figure  21 .  1 -nearest  neighbor  classification  using  density  sensitive  distance  with 
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on  the  artificial  dataset. 


Note  that  as  odd  as  the  results  of  Figure  21  are,  we  will  retain  its  choice  of  7  for 
classification  later  in  this  work. 

Lastly,  since  the  kernel  density  estimate  used  by  the  proposed  density  sensitive 
distance  measurement  is  non-parametric,  the  entire  dataset  must  be  retained  and 
repeatedly  iterated  over  for  each  measurement,  not  just  representative  samples  from  that 
dataset.  Therefore,  there  is  a  high  computational  cost  to  the  proposed  density  sensitive 
distance  measurement. 
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IV.  METHODOLOGY 


To  test  the  utility  of  the  proposed  density  sensitive  distanee  measurement,  we  use 
this  distance  measurement  to  perform  supervised  learning.  For  k  -Nearest  Neighbor 
classification  using  the  density  sensitive  distance  measurement,  we  train  and  test  on  two 
real-world  datasets.  We  compare  this  to  k  -Nearest  Neighbor  classification  using 
Euclidean  distance.  Lastly,  to  put  both  of  these  results  into  context,  we  perform  Support 
Vector  Machine  and  Random  Forests  classification  to  see  how  well  classification  using 
this  density  sensitive  distant  measurement  stands  up  against  modem  supervised  learning 
algorithms.  For  all  classifiers,  we  use  stratified  10-fold  cross  validation  to  obtain  the 
generalization  error  of  each  classifier,  record  the  overall  accuracy  and  error  rate,  and 
document  the  precision  and  recall  for  each  class. 

A,  DATASETS 

We  test  the  proposed  density  sensitive  distance  measurement  on  two  datasets  -  the 
entire  Wisconsin  Diagnostic  Breast  Cancer  (WDBC)  dataset  and  a  portion  of  the  MNIST 
Database  of  Handwritten  Digits. 

1.  The  Wisconsin  Diagnostic  Breast  Cancer  (WDBC)  Dataset 

The  Wisconsin  Diagnostic  Breast  Cancer  (WDBC)  dataset  is  from  the  University 
of  California,  Irvine,  repository  {UCI  Machine  Learning  Repository:  Breast  Cancer 
Wisconsin  (Diagnostic)  Data  Set.).  This  is  a  small  multivariate  dataset  with  569  total 
datum  where  each  datum  consists  of  30  real-valued  components.  Each  datum  is 
constmcted  from  a  digitized  image  of  a  fine  needle  aspirate  (FNA)  of  a  breast  mass.  The 
datum  represent  characteristics  of  the  cell  nuclei  present  in  the  image.  Ten  real-valued 
components  are  computed  for  each  cell  nucleus: 

1)  radius  (mean  of  distances  from  center  to  points  on  the  perimeter), 

2)  texture  (standard  deviation  of  gray-scale  values), 

3)  perimeter. 
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4)  area, 

5)  smoothness  (loeal  variation  in  radius  lengths), 

6)  eompaetness  ( perimeter^^ area  —  1 ), 

7)  coneavity  (severity  of  coneave  portions  of  the  eontour), 

8)  eoneave  points  (number  of  eoneave  portions  of  the  eontour), 

9)  symmetry,  and 

10)  fraetal  dimension  ("coastline  approximation"  —  1). 

The  mean,  standard  error,  and  "worst"  or  largest  (mean  of  the  three  largest  values)  of 
these  eomponents  were  eomputed  for  eaeh  image,  resulting  in  30  total  eomponents.  For 
instanee,  the  first  eomponent  is  the  mean  of  the  radius,  the  11th  eomponent  is  the 
standard  error  of  the  radius,  and  21  eomponent  is  the  worst  radius. 

Of  the  569  total  datum,  357  datum  represent  benign  tumors  and  212  represent 
malignant  ones. 

2.  The  MNIST  Database  of  Handwritten  Digits 

The  MNIST  Database  of  Handwritten  Digits  is  a  subset  of  a  larger  database 
available  from  the  National  Institute  of  Standards  and  Teehnology  (NIST)  (MNIST 
Handwritten  Digit  Database,  Yann  LeCun  and  Corinna  Cortes.).  The  MNIST  database 
was  eonstrueted  from  NIST's  Speeial  Database  I  and  Speeial  Database  3  whieh  eontain 
binary  images  of  handwritten  digits.  For  the  MNIST  database,  the  original  binary  images 
from  the  NIST  databases  were  size  normalized  to  fit  in  20  x  20  pixel  windows  while 
preserving  their  aspeet  ratio.  The  resulting  images  eontain  grey  levels  as  a  result  of  the 
anti-aliasing  technique  used  by  the  normalization  algorithm.  Each  20  x  20  pixel  image 
was  centered  in  a  28  x  28  pixel  window  by  computing  the  center  of  mass  of  the  pixels, 
and  translating  the  image  so  as  to  position  this  point  at  the  center  of  this  28  x  28  pixel 
window. 
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The  MNIST  database  consists  of  two  sets  of  images  -  a  training  set  of  60,000 
images  and  a  testing  set  of  10,000  images.  The  60,000  pattern  training  set  contains 
examples  from  approximately  250  different  writers. 

For  this  work,  we  only  use  the  handwritten  ones,  twos,  and  threes  from  the 
MNIST  training  set.  This  is  a  medium  sized  multivariate  dataset  with  18,831  total  datum 
where  each  datum  consists  of  784  integer-valued  components  with  integers  ranging  from 
0  to  255.  Of  the  18,831  total  datum,  6,742  datum  represent  handwritten  ones,  5,958 
datum  represent  handwritten  twos,  and  6,131  datum  represent  handwritten  threes. 
Examples  of  the  handwritten  ones,  twos,  and  threes  are  shown  in  original  and  enlarged 
sizes  in  Figure  22,  Figure  23,  and  Figure  24. 


Figure  22.  An  example  of  a  handwritten  one  from  the  MNIST  training  dataset.  Feft,  the 

original  size  of  the  example.  Right,  the  enlarged  size. 
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Figure  23.  An  example  of  a  handwritten  two  from  the  MNIST  training  dataset.  Left,  the 

original  size  of  the  example.  Right,  the  enlarged  size. 


Figure  24.  An  example  of  a  handwritten  three  from  the  MNIST  training  dataset.  Left,  the 

original  size  of  the  example.  Right,  the  enlarged  size. 
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B,  SUPERVISED  LEARNING 

Supervised  learning  is  learning  in  which  an  algorithm  receives  a  set  of  input 
datum  and  their  corresponding  output  datum  (i.e.,  a  training  dataset),  trains  on  this  data  to 
find  a  function  of  the  input  data  that  approximates  the  known  output  data,  and  then  uses 
that  trained  function  on  unknown  input  data  (i.e.,  on  a  testing  dataset)  (Izenman,  2008). 
The  input  data  may  contain  continuous  or  categorical  values.  For  classification,  the 
output  data  consists  of  categorical  values,  usually  called  labels.  The  goal  of  the  learning 
algorithm  for  classification  is  to  minimize  the  error  incurred  during  the  testing  phase 
while  only  training  on  the  training  dataset.  In  essence,  supervised  learning  is  analogous 
to  classroom  instruction.  A  teacher  presents  each  student  with  a  set  of  various  problems 
and  their  respective  correct  answers,  the  student  then  conceptualizes  those  problems  and 
their  respective  answers,  and  finally,  the  student  is  tested  on  previously  unseen  problems 
and  a  grade  is  recorded. 

C.  CLASSIFICATION  ALGORITHMS 

For  this  work,  we  use  the  following  classification  algorithms:  k  -Nearest 
Neighbor  classification  using  the  proposed  density  sensitive  distance  measurement,  k  - 
Nearest  Neighbor  classification  using  Euclidean  distance.  Support  Vector  Machine,  and 
Random  Forests. 

1.  k  -Nearest  Neighbor  Classification 

k  -Nearest  Neighbor  classification  is  classification  of  a  testing  datum  based  on  the 
majority  vote  of  the  class  labels  of  k  most  similar  training  data.  For  this  work,  similarity 
will  be  determined  by  our  density  sensitive  distance  measurement  and  by  Euclidean 
distance. 

Although  1 -Nearest  Neighbor  is  a  sub-optimal  procedure,  its  error  rate  can  be 
bounded  from  below.  Given  an  unlimited  amount  of  training  data,  1 -Nearest  Neighbor 
classification  has  an  error  rate  guaranteed  to  be  no  worse  than  twice  the  Bayes  error  rate, 
the  minimum  possible  error  rate  (Duda,  Hart,  &  Stork,  2000). 
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2,  Support  Vector  Machine  Classification 


Support  Vector  Machine  classification  projects  training  data  into  a  higher¬ 
dimensional  space  by  creating  new  dimensions  from  combinations  of  the  original 
dimensions  and  then  finds  the  hyperplane  that  best  separates  the  classes  of  data  in  that 
higher-dimension  (Bradski  &  Kaehler,  2008).  Kernel  functions,  such  as  the  polynomial 
kernel  or  the  radial  basis  function  kernel,  are  used  to  creating  those  new  dimensions  from 
combinations  of  original  dimensions.  During  testing,  incoming  data  are  projected  into 
the  higher-dimension  using  the  kernel  function,  an  inner-product  is  taken  based  on  the 
normal  vector  of  the  class-separating  hyperplane,  and  the  sign  of  that  inner  product 
determines  the  classification  of  the  data.  For  multiple-class  classification  problems, 
multiple  hyperplanes  can  be  constructed.  For  example,  in  a  three  class  classification 
problem,  the  first  hyperplane  can  separate  class  1  data  from  non-class  1  data  (i.e.,  class  2 
&  3  data).  The  second  hyperplane  can  separate  class  2  data  from  non-class  2  data  (i.e., 
class  3  data).  During  the  testing  phase,  a  datum  that  is  initially  classified  as  a  non-class  1 
datum  need  only  look  to  the  second  separating  hyperplane  to  determine  if  that  datum 
should  be  classified  as  class  2  or  class  3. 

For  this  work,  we  use  the  Support  Vector  Machine  implementation  in  OpenCV 
1.1  (OpenCV  1.1  2008). 

3,  Random  Forests  Classification 

Random  Forests  {Random  Forests  2009)  classification  randomly  constructs 
multiple  decision  trees  based  on  training  data.  During  testing,  each  tree  votes  on  the 
classification  of  incoming  datum.  Incoming  datum  are  classified  based  on  the  class  with 
the  most  votes.  Below  we  give  a  brief  description  of  Random  Forests.  For  a  more 
detailed  explanation,  refer  to  {Random  Forests  2009). 

In  Random  Forests,  each  tree  is  constructed  as  follows: 

1)  If  the  size  of  the  training  set  is  m ,  then  sample  m  cases  at  random  with 
replacement  and  grow  the  tree  from  this  sample  set. 
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2)  If  there  are  n  eomponents,  a  number  k  «n  is  specified  such  that  at  each 
node,  k  oi  n  components  are  selected  at  random  and  the  best  split  on  these  k  is  used  to 
split  the  node.  The  value  of  k  is  held  constant  during  construction  of  the  entire  forest. 

3)  No  tree  is  pruned  and  each  tree  is  grown  to  the  largest  extent  possible. 

The  Random  Forest  error  rate  depends  on  two  things:  1)  the  correlation  between 
any  two  trees  in  the  forest  and  2)  the  strength  of  each  individual  tree  in  the  forest. 
Increasing  the  correlation  between  trees  increases  the  error  rate.  Increasing  the  strength 
of  the  individual  trees  decreases  the  error  rate.  Reducing  k  reduces  both  the  correlation 
and  the  strength.  Increasing  k  increases  correlation  and  strength.  Hence,  we  need  to 
find  the  k  that  optimizes  these  parameters. 

For  this  work,  we  use  the  Random  Forests  implementation  in  OpenCV  1.1 
(OpenCV  1.1  2008). 

D,  STRATIFIED  lO-FOLD  CROSS  VALIDATION 

For  this  work,  the  classification  algorithms  are  trained  and  tested  using  stratified 
10-fold  cross  validation. 

For  c  different  classes,  we  separate  the  original  data  set  into  c  class  data  sets 
where  each  class  data  set  only  contains  data  with  the  same  class  label;  in  other  words,  the 
data  in  each  of  these  class  data  sets  are  from  the  same  class. 

Over  10  iterations,  we  then  separate  each  class  data  set  into  two  different  sets  -  a 
class  training  set  and  a  class  testing  set.  For  the  first  iteration,  the  first  10%  of  each  class 
data  set  is  used  for  testing  and  the  remaining  90%  is  used  for  training.  For  the  second 
iteration,  the  next  10%  of  each  class  data  set  is  used  for  testing  and  the  remaining  90%  is 
used  for  training.  The  data  in  third  through  the  tenth  iteration  is  divided  similarly  so  that 
all  of  the  data  in  the  class  data  sets  are  used  for  testing  during  exactly  one  of  the 
iterations. 

Cross  validation  is  used  to  estimate  prediction  error  (Hastie,  Tibshirani,  & 
Friedman,  2001).  Cross  validation  directly  estimates  the  generalization  error  when  a 

classification  algorithm  is  applied  to  an  independent  sample  testing  dataset. 
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E,  STATISTICS 

For  each  of  the  10  folds  of  a  cross  validation,  we  record  the  confusion  matrix  for 
that  fold.  From  that  confusion  matrix,  we  determine  the  overall  accuracy  and  error  rate 
for  that  fold.  Additionally,  the  confusion  matrix  is  also  used  to  determine  the  precision 
and  recall  of  each  class  during  that  fold.  At  the  end  of  the  10  folds  of  the  cross 
validation,  we  find  the  mean  and  the  standard  deviation  of  overall  accuracy  and  error. 
We  also  find  the  mean  and  standard  deviation  of  the  precision  and  recall  of  each  class. 

1,  Confusion  Matrix 

The  confusion  matrix  is  a  matrix  that  consists  of  rows  that  represent  predicted 
classes  and  columns  that  represent  actual  classes  from  the  testing  phase  of  cross 
validation  (as  in  Table  1.  ) 


Act 

;ual 

Class  of 
Interest 

All  Other 
Classes 

Class  of 

True 

Positive 

False 

Positive 

Interest 

(TP) 

(FP) 

All  Other 
Classes 

False 

True 

Ph 

Negative 

Negative 

(FN) 

(TN) 

Table  1.  The  Confusion  Matrix. 


During  the  testing  phase  of  cross  validation,  a  classification  algorithm  will  predict 
the  class  of  a  testing  datum.  Since  we  also  know  the  actual  class  of  the  testing  datum 
during  cross  validation,  then  we  can  increment  the  cell  in  the  confusion  matrix  that 
corresponds  to  the  class  the  classifier  predicted  for  the  testing  datum  (i.e.,  the  row)  and 
the  actual  class  of  the  testing  datum  (i.e.,  the  column). 

When  determining  true  positives,  false  positives,  false  negatives,  and  true 
negatives,  we  must  first  select  a  class  of  interest.  For  instance,  if  we  have  the  confusion 
matrix  show  in  Table  2.  and  we  choose  Class  1  as  our  class  of  interest,  then  the  count  of 
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our  True  Positives  (i.e.,  Predicted:  Class  1  and  Actual:  Class  1)  would  be  10 ,  the  count  of 
our  False  Positives  (i.e.,  Predicted:  Class  1  and  Actual:  All  Other  Classes)  would  be 
1  +  2  =  3,  the  count  of  our  False  Negatives  (i.e..  Predicted:  All  Other  Classes  and 
Actual:  Class  1)  would  be  3  +  5  =  8,  and  the  count  of  our  True  Negatives  (i.e.. 
Predicted:  All  Other  Classes  and  Actual:  All  Other  Classes)  would  be 
20  +  4  +  6  +  30  =  60  . 


Actual 

Class 

Class 

Class 

1 

2 

3 

Class 

1 

10 

1 

2 

Class 

2 

3 

20 

4 

Ph 

Class 

3 

5 

6 

30 

Table  2.  An  example  three-class  confusion  matrix. 


2,  Overall  Accuracy 

Overall  accuracy  represents  how  well  a  classifier  performed  during  a  fold  of  cross 
validation  procedure.  From  the  confusion  matrix,  overall  accuracy  is  computed  by 
dividing  the  summation  of  the  value  on  the  main  diagonal  by  the  summation  of  every 
value  in  the  matrix.  In  other  words,  the  overall  accuracy  is  the  following: 

ycM . 

where  CM^  .  represents  the  i  -th  row  and  j  -th  column  of  the  confusion  matrix  and  c  is 
the  number  of  separate  classes. 

3,  Overall  Error  Rate 

Overall  error  rate  represents  how  poorly  a  classifier  performed  during  a  fold  of 
the  cross  validation  procedure.  Overall  error  rate  is  the  opposite  of  overall  accuracy; 
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however,  we  can  also  compute  the  overall  error  rate  from  the  confusion  matrix  by 
dividing  the  summation  of  all  values  off  of  the  main  diagonal  by  the  summation  of  every 
value  in  the  matrix.  In  other  words,  the  overall  error  rate  is  the  following: 


ycM 

i^j _ 

Z—/  Z—/  1,1 


4.  Precision 


To  calculate  precision,  we  must  first  fix  a  class  of  interest.  Once  a  class  of 
interest  is  chosen,  then  precision  can  be  calculated  by  the  following: 


.  .  True  Positives 

Precision  = - . 

True  Positives  +  False  Positives 


5.  Recall 


To  calculate  recall,  we  must  first  fix  a  class  of  interest.  Once  a  class  of  interest  is 
chosen,  then  recall  can  be  calculated  by  the  following: 


_  True  Positives 

True  Positives  +  False  Negatives 


A  good  classification  algorithm  may  produce  high  values  for  both  precision  and 
for  recall.  A  poor  classifier  will  produce  high  values  for  either  precision  or  for  recall,  but 
not  for  both. 


44 


V.  RESULTS 


The  results  for  classification  over  the  Wisconsin  Diagnostic  Breast  Cancer 
(WDBC)  dataset  and  the  ones,  twos,  and  threes  from  the  MNIST  Database  of 
Handwritten  Digits  using  various  classifiers  are  presented.  We  report  the  results  of  the 
two  best  parameterizations  for  each  type  of  classifier  (as  determined  by  the  overall 
accuracy). 

A,  THE  WISCONSIN  DIAGNOSTIC  BREAST  CANCER  (WDBC)  DATASET 


1.  Overall  Accuracy  and  Error  Rate 


Classifier 

Overall  Accuracy 

Overall  Error  Rate 

k  -Nearest  Neighbor  using 

0.943759 

0.0562408 

the  Density  Sensitive 

Distance  Measurement 

with  k,  =8  and 

classiiication 

^kernel  density  100 

estimation 

±0.0111921 

±0.0111921 

k  -Nearest  Neighbor  using 

0.938525 

0.0614748 

the  Density  Sensitive 

±0.0146528 

±0.0146528 

Distance  Measurement 

with  k ,  ...  =  10  and 

^kernel  density  106 

estimation 
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k  -Nearest  Neighbor  using 

Euclidean  Distance 

with  k  =  S 

0.938558 

±0.0188182 

0.0614424 

±0.0188182 

k  -Nearest  Neighbor  using 

Euclidean  Distance 

with  A:  =  10 

0.935048 

±0.0185173 

0.0649523 

±0.0185173 

Support  Vector  Machine  using 

the  Polynomial  Kernel  with 

Degree  =  3,7  =  10^® , 

p  =  0.1,  and  coefO  =  240 

0.962938 

±0.0348735 

0.0370625 

±0.0348735 

Support  Vector  Machine  using 

the  Polynomial  Kernel  with 

Degree  =  5,7  =  10^® , 

=  0.1,  and  coefO  =  2.4 

0.961275 

±0.0233969 

0.038725 

±0.0233969 

Random  Eorests  with 

trees  =  20 ,  depth  =  20 , 

and  split  size  =  11 

0.966569 

±0.0154972 

0.0334306 

±0.0154972 

Random  Eorests  with 

trees  =  20 ,  depth  =  20 , 

and  split  size  =  6 

0.966538 

±0.0255376 

0.0334619 

±0.0255376 

Table  3.  Overall  Aeeuraey  and  Error  Rate  for  the  Wiseonsin  Diagnostic  Breast 

Cancer  (WDBC)  Dataset 
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2. 


Class  Precision  and  Recall 


Class:  Malignant 

Class:  Benign 

Classifier 

Precision 

Recall 

Precision 

Recall 

k  -Nearest  Neighbor  using 

the  Density  Sensitive 

Distanee  Measurement 

with  k ,  =8  and 

classmcation 

^kernel  density  100 

estimation 

0.948574 

±0.0397934 

0.901082 

±0.0565543 

0.9443 

±0.0301339 

0.969127 

±0.0246911 

k  -Nearest  Neighbor  using 

the  Density  Sensitive 

Distanee  Measurement 

with  A:,  =10  and 

^kernel  density  106 

estimation 

0.952244 

±0.037813 

0.882468 

±0.0625009 

0.934446 

±0.0321631 

0.971905 

±0.0231181 

k  -Nearest  Neighbor  using 

Euelidean  Distanee 

with  k  =  8 

0.935527 

±0.0488191 

0.901082 

±0.0648552 

0.944229 

±0.0353156 

0.960794 

±0.0300639 

k  -Nearest  Neighbor  using 

Euclidean  Distance 

with  A:  =  10 

0.934084 

±0.0381952 

0.891775 

±0.0737203 

0.939503 

±0.0398287 

0.960714 

±0.0236421 
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Support  Vector  Machine 

using  the  Polynomial 

Kernel  with 

Degree  =  3,7  =  10^® , 

=  0.1 ,  and  coefO  =  240 

0.953864 

±0.055712 

0.947835 

±0.0524313 

0.969643 

±0.0303464 

0.971905 

±0.035075 

Support  Vector  Machine 

using  the  Polynomial 

Kernel  with 

Degree  =  5,7  =  10^® , 

f  —  Q.l,  and  coefO  =  2.4 

0.954192 

±0.0468176 

0.943506 

±0.0434155 

0.967224 

±0.0248127 

0.971905 

±0.0297879 

Random  Forests  with 

trees  =  20 ,  depth  =  20 , 

and  split  size  =  11 

0.963834 

±0.0417267 

0.948052 

±0.0345719 

0.969932 

±0.0194968 

0.977619 

±0.0257602 

Random  Forests  with 

trees  =  20 ,  depth  =  20 , 

and  split  size  =  6 

0.959844 

±0.0521776 

0.952814 

±0.0383116 

0.972399 

±0.0219143 

0.974762 

±0.0337253 

Table  4.  Precision  and  Recall  for  each  class  in  the  Wisconsin  Diagnostic  Breast 

Cancer  (WDBC)  Dataset 


3,  Precision  and  Recall  Curves 

The  following  plots  show  precision  versus  recall  for  all  10-fold  cross  validation 
runs  of  all  classifiers.  In  these  curves,  the  scale  for  the  Precision  and  Recall  axes  range 
from  0.60  to  1.00,  vice  0.00  to  1.00,  to  emphasize  the  results.  Note  that  some 
individual  runs  of  stratified  10-fold  cross  validation  of  Support  Vector  Machine  and 
Random  Forests  classification  were  perfect. 
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Figure  25. 


Figure  26. 


The  Preeision  and  Reeall  Curve  for  the  k  -Nearest  Neighbor  elassifiers  using 
the  Density  Sensitive  Distance  Measurement  for  the  Malignant  Class  in  the 
Wisconsin  Diagnostic  Breast  Cancer  (WDBC)  Dataset 


The  Precision  and  Recall  Curve  for  the  k  -Nearest  Neighbor  classifiers  using 
the  Density  Sensitive  Distance  Measurement  for  the  Benign  Class  in  the 
Wisconsin  Diagnostic  Breast  Cancer  (WDBC)  Dataset 
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Figure  27.  The  Precision  and  Recall  Curve  for  the  k  -Nearest  Neighbor  classifiers  using 
Euclidean  Distance  for  the  Malignant  Class  in  the  Wisconsin  Diagnostic  Breast 

Cancer  (WDBC)  Dataset 


Figure  28. 


The  Precision  and  Recall  Curve  for  the  k  -Nearest  Neighbor  classifiers  using 
Euclidean  Distance  for  the  Benign  Class  in  the  Wisconsin  Diagnostic  Breast 

Cancer  (WDBC)  Dataset 
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Figure  29.  The  Precision  and  Recall  Curve  for  the  Support  Vector  Machines  for  the 
Malignant  Class  in  the  Wisconsin  Diagnostic  Breast  Cancer  (WDBC)  Dataset 


Figure  30.  The  Precision  and  Recall  Curve  for  the  Support  Vector  Machines  for  the 
Benign  Class  in  the  Wisconsin  Diagnostic  Breast  Cancer  (WDBC)  Dataset 
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Figure  3 1 .  The  Precision  and  Recall  Curve  for  the  Random  Forests  for  the  Malignant 
Class  in  the  Wisconsin  Diagnostic  Breast  Cancer  (WDBC)  Dataset 


Figure  32.  The  Precision  and  Recall  Curve  for  the  Random  Forests  for  the  Benign  Class 
in  the  Wisconsin  Diagnostic  Breast  Cancer  (WDBC)  Dataset 
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4. 


Discussion 


Since  the  proposed  density  sensitive  distanee  measurement  is  essentially  a  loeally 
weighted  Euelidean  distanee,  A: -Nearest  Neighbor  elassifieation  using  this  density 
sensitive  distanee  measurement  slightly  outperforms  the  same  elassifier  using  Euelidean 
distanee.  However,  we  note  that  the  intervals  (i.e.,  the  means  ±  the  standard  deviations) 
overlap. 

As  expeeted,  the  modem  supervised  learning  algorithm.  Random  Forest, 
dominates  overall  aeeuraey,  overall  error  rate,  and  preeision  and  reeall  for  eaeh  elass. 
Although  the  intervals  for  k  -Nearest  Neighbor  using  the  density  sensitive  distanee 
measurement  and  Random  Forest  do  overlap,  that  overlap  is  quite  slight. 

B.  THE  ONES,  TWOS,  AND  THREES  FROM  THE  MNIST  DATABASE  OF 
HANDWRITTEN  DIGITS 


1,  Overall  Accuracy  and  Error  Rate 


Classifier 

Overall  Accuracy 

Overall  Error  Rate 

k  -Nearest  Neighbor  using 

0.994105 

0.00589454 

the  Density  Sensitive 

±0.00168777 

±0.00168777 

Distanee  Measurement 

with  k  ^  =1  and 

classiiication 

^kernel  density  150 

estimation 
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k  -Nearest  Neighbor  using 

the  Density  Sensitive 

Distanee  Measurement 

with  k ,  =1  and 

classincation 

^kernel  density  100 

estimation 

0.994052 

±0.00169415 

0.00594762 

±0.00169415 

k  -Nearest  Neighbor  using 

Euclidean  Distance 

with  k  =  1 

0.993575 

±0.00179531 

0.00642549 

±0.00179531 

k  -Nearest  Neighbor  using 

Euclidean  Distance 

with  k  =  3 

0.992353 

±0.00203715 

0.00764703 

±0.00203715 

Support  Vector  Machine  using 

RBE  Kernel  with 

7  =  5.99484  x  10^" 

and  ly  =  0.1 

0.98837 

±0.00364056 

0.0116298 

±0.00364056 

Support  Vector  Machine  using 

RBE  Kernel  with 

7  =  10^' 

and  ly  =  0.1 

0.981785 

±0.00472434 

0.0182148 

±0.00472434 

Random  Eorests  with 

trees  =  20 ,  depth  =  20 , 

and  split  size  =  4 

0.985343 

±0.00392884 

0.0146569 

±0.00392884 
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Random  Forests  with 

0.985025 

0.0149753 

trees  =  20 ,  depth  =  20 , 

±0.00416513 

±0.00416513 

and  split  size  =  3 

Table  5.  Overall  Aeeuracy  and  Error  Rate  for  the  Ones,  Twos,  and  Threes  from  the 

MNIST  Database  of  Handwritten  Digits 


2,  Class  Precision  and  Recall 


Class:  Ones 

Class:  Twos 

Class:  Threes 

Classifier 

Precision 

Recall 

Precision 

Recall 

Precision 

Recall 

k  -Nearest  Neighbor 

using  the  Density 

Sensitive  Distance 

Measurement  with 

A: ,  =1  and 

^kernel  density  1^0 

estimation 

0.995565 

±0.00294162 

0.99733 

±0.00136235 

0.991615 

±0.00304218 

0.991777 

±0.00278905 

0.994937 

±0.00161508 

0.992824 

±0.00354039 

k  -Nearest  Neighbor 

using  the  Density 

Sensitive  Distance 

Measurement  with 

A: ,  =1  and 

^kernel  density  100 

estimation 

0.995419 

±0.00297786 

0.99733 

±0.00136235 

0.991614 

±0.0030415 

0.991609 

±0.00295768 

0.994937 

±0.00161508 

0.992824 

±0.00354039 

k  -Nearest  Neighbor 

using  Euclidean 

Distance  with 

A:  =  l 

0.993805 

±0.00330286 

0.997924 

±0.00143185 

0.992094 

±0.00326101 

0.989091 

±0.00308645 

0.994777 

±0.00251895 

0.99315 

±0.00358793 
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k  -Nearest  Neighbor 

using  Euclidean 

Distance  with 

k  =  3 

0.989572 

±0.0035872 

0.99822 

±0.00168445 

0.992744 

±0.00285545 

0.986407 

±0.00527022 

0.995094 

±0.00297708 

0.991682 

±0.00236131 

Support  Vector 

Machine  using  RBF 

Kernel  with 

7  =  5.99484  x  10^' 

and  i'  =  0.1 

0.997592 

±0.001611 

0.982498 

±0.00702177 

0.977642 

±0.00713341 

0.995805 

±0.00276832 

0.989069 

±0.00539673 

0.987604 

±0.004156 

Support  Vector 

Machine  using  RBF 

Kernel  with 

7  =  10^^  and 

=  0.1 

0.987824 

±0.00496109 

0.985168 

±0.00576701 

0.974657 

±0.00679761 

0.986239 

±0.00416655 

0.982254 

±0.00795159 

0.97374 

±0.00739521 

Random  Forests 

with  trees  =  20 , 

depth  =  20 ,  and 

split  size  =  4 

0.994931 

±0.00322289 

0.988876 

±0.00485519 

0.972341 

±0.00813912 

0.989427 

±0.00371177 

0.987798 

±0.00305495 

0.97749 

±0.00752933 

Random  Forests 

with  trees  =  20 , 

depth  =  20 ,  and 

split  size  =  3 

0.994485 

±0.0043141 

0.988283 

±0.00481462 

0.972783 

±0.00615367 

0.988756 

±0.00581317 

0.986863 

±0.00733193 

0.977817 

±0.00667079 

Table  6.  Precision  and  Recall  for  the  Ones,  Twos,  and  Threes  from  the  MNIST 

Database  of  Handwritten  Digits 
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3. 


Precision  and  Recall  Curves 


The  following  plots  show  precision  versus  recall  for  all  10-fold  cross-validation 
runs  of  all  classifiers.  In  these  curves,  the  scale  for  the  Precision  and  Recall  axes  range 
from  0.90  to  1.00,  vice  0.00  to  1.00,  to  emphasize  the  results. 


Figure  33.  The  Precision  and  Recall  Curve  for  the  k  -Nearest  Neighbor  classifiers  using 
the  Density  Sensitive  Distance  Measurement  for  the  Ones  Class  of  the  MNIST 

Database  of  Handwritten  Digits 
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Figure  34.  The  Precision  and  Recall  Curve  for  the  k  -Nearest  Neighbor  classifiers  using 
the  Density  Sensitive  Distance  Measurement  for  the  Twos  Class  of  the  MNIST 

Database  of  Handwritten  Digits 


Figure  35.  The  Precision  and  Recall  Curve  for  the  k  -Nearest  Neighbor  classifiers  using 
the  Density  Sensitive  Distance  Measurement  for  the  Threes  Class  of  the  MNIST 

Database  of  Handwritten  Digits 
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Figure  36.  The  Preeision  and  Recall  Curve  for  the  k  -Nearest  Neighbor  classifiers  using 
Euclidean  Distance  for  the  Ones  Class  of  the  MNIST  Database  of  Handwritten 

Digits 


Figure  37.  The  Precision  and  Recall  Curve  for  the  k  -Nearest  Neighbor  classifiers  using 
Euclidean  Distance  for  the  Twos  Class  of  the  MNIST  Database  of  Handwritten 

Digits 
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Figure  38.  The  Preeision  and  Recall  Curve  for  the  k  -Nearest  Neighbor  classifiers  using 
Euclidean  Distance  for  the  Threes  Class  of  the  MNIST  Database  of  Handwritten 

Digits 


Figure  39.  The  Precision  and  Recall  Curve  for  the  Support  Vector  Machines  for  the  Ones 

Class  of  the  MNIST  Database  of  Handwritten  Digits 
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Precision/Recall  Cuive  for  the  Twos  Class 


Figure  40.  The  Preeision  and  Recall  Curve  for  the  Support  Vector  Machines  for  the  Twos 

Class  of  the  MNIST  Database  of  Handwritten  Digits 


Figure  41 .  The  Precision  and  Recall  Curve  for  the  Support  Vector  Machines  for  the 
Threes  Class  of  the  MNIST  Database  of  Handwritten  Digits 
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Figure  42.  The  Precision  and  Recall  Curve  for  the  Random  Forests  for  the  Ones  Class  of 

the  MNIST  Database  of  Handwritten  Digits 


Figure  43.  The  Precision  and  Recall  Curve  for  the  Random  Forests  for  the  Twos  Class  of 

the  MNIST  Database  of  Handwritten  Digits 
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Figure  44.  The  Precision  and  Recall  Curve  for  the  Random  Forests  for  the  Threes  Class 

of  the  MNIST  Database  of  Handwritten  Digits 

4.  Discussion 

As  with  the  WDBC  dataset,  k  -Nearest  Neighbor  classification  using  the  density 
sensitive  distance  measurement  again  slightly  outperforms  the  same  classifier  using 
Euclidean  distance.  Similarly,  we  note  that  the  intervals  (i.e.,  the  means  ±  the  standard 
deviations)  overlap. 

However,  the  modem  supervised  learning  classification  algorithms.  Support 
Vector  Machine  and  Random  Forests,  do  not  dominate  the  overall  accuracy  and  overall 
error  rate.  For  both  overall  accuracy  and  error  rate,  the  classifier  using  our  density 
sensitive  distance  measurement  has  superior  performance.  Moreover,  the  intervals  of  the 
Support  Vector  Machines  and  Random  Forests  do  not  overlap  with  the  classifier  using 
our  density  sensitive  distance  measurement. 
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VI.  SUMMARY  &  CONCLUSIONS 


A,  SUMMARY 

The  proposed  density  sensitive  distanee  measurement  takes  into  aecount  the 
density  of  eaeh  dataset  over  whieh  it  is  used.  This  density  sensitive  distance 
measurement  first  finds  the  kernel  density  estimate  of  a  given  dataset  and  then  takes  the 
line  integral  along  the  surface  of  that  kernel  density  estimate  as  we  travel  linearly  from  an 
initial  position  to  a  final  position.  The  parameters  required  to  be  determined  for  this 
density  sensitive  distance  measure  are  the  kernel  bandwidth  and  the  scale.  In  this  work, 
the  kernel  bandwidth  is  cr ,  the  radius  of  the  sphere  each  kernel  approximates.  Since  we 
arbitrarily  desire  smooth  kernel  density  estimates,  we  take  advantage  the  additive 
properties  of  the  chosen  kernel  when  their  centers  are  within  2cr  of  each  other.  Hence, 
we  find  the  distance  of  the  k  -th  nearest  neighbor  for  each  data  point  in  a  dataset  and 
form  a  value  for  a  around  these  k  -th  nearest  neighbors  distances.  The  scale  7  allows 
the  variance  in  the  kernel  density  estimate  (i.e.,  the  "vertical"  variance)  to  be  modified  so 
that  it  will  not  be  overpowered  by  the  variances  in  the  x  direction  (i.e.,  "lateral" 
variances). 

From  the  definition  of  the  proposed  density  sensitive  distance  measurement,  we 
utilized  the  density  sensitive  distance  measurement  in  supervised  learning  in  order  to 
determine  its  utility  and  performance.  Using  stratified  10-fold  cross  validation  to 
determine  the  generalization  error,  we  trained  and  tested  the  k  -Nearest  Neighbor 
classifier  using  the  proposed  measurement.  We  also  compared  that  classifier  with  k  - 
Nearest  Neighbor  classification  using  Euclidean  distance  and  two  modem  supervised 
learning  algorithms.  Support  Vector  Machines  and  Random  Forests.  This  comparison 
took  place  over  two  datasets,  the  Wisconsin  Diagnostic  Breast  Cancer  (WDBC)  dataset 
and  a  portion  of  the  MNIST  Database  of  Handwritten  Digits. 
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B,  CONCLUSIONS 

The  proposed  density  sensitive  distanee  measure  behaved  as  if  it  were  a  loeally 
weighted  Euelidean  distanee.  As  k  -Nearest  Neighbor  olassifieation  using  Euelidean 
distanee  did  well,  then  k  -Nearest  Neighbor  using  our  density  sensitive  distanee  did 
slightly  better.  During  elassifieation,  when  proximity  was  a  nominal  faetor  eompared  to 
density,  as  it  was  in  the  WDBC  dataset,  then  our  density  sensitive  distanee  measurement 
was  nominally  more  sueeessful  than  Euelidean  distanee,  but  subordinate  to  the  modern 
algorithms  involved  in  Support  Veetor  Maehine  and  Random  Eorests  olassifieation. 
When  proximity  was  a  larger  faetor  in  olassifieation,  as  it  was  in  the  MNIST  dataset,  then 
our  density  sensitive  distance  measurement  was  superior  to  Support  Vector  Machines  and 
Random  Eorests  (although  still  only  nominally  better  than  Euclidean  distance).  All 
classifiers  over  both  datasets  did  extremely  well;  therefore,  future  research  using  this 
density  sensitive  distance  measurement  for  classification  should  concentrate  on  more 
difficult  datasets. 

The  proposed  density  sensitive  distance  measurement  conforms  better  to  the 
shape  of  the  data  than  Euclidean  distance  and  performs  slightly  better  in  k  -Nearest 
Neighbor  classification;  however,  this  density  sensitive  distance  measurement  comes  at  a 
high  computation  cost.  Since  the  line  integral  must  use  values  from  the  kernel  density 
estimation,  a  single  distance  calculation  must  iterate  over  the  entire  training  dataset 
hundreds  (if  not  thousands  of  time)  times.  Eor  small  datasets,  this  can  be  negligible; 
however,  for  medium  sized  databases  and  beyond,  this  is  may  be  too  great  a  price  to  pay. 
To  mitigate  this  computational  cost,  there  are  many  approximations  that  can  be  made  to 
substantially  speed  up  the  calculation  of  this  density  sensitive  distance  measurement. 

C.  FUTURE  WORK 

The  proposed  density  sensitive  distance  measurement  was  used  in  classification 
on  easier  datasets;  hence,  all  classifiers  performed  extremely  well.  Since  all  the 
classifiers  had  exceptional  performance,  it  made  a  definitive  comparison  more 
challenging.  In  future  comparisons,  more  discriminating  datasets  should  be  used. 


66 


Also,  the  current  implementation  of  the  proposed  density  sensitive  distance 
measurement  can  be  optimized  to  only  take  into  account  training  points  that  are 
approximately  near  to  any  given  testing  point.  Depending  on  the  value  of  a ,  many 
points  in  the  training  dataset  may  negligibly  contribute  to  the  value  of  the  kernel  density 
estimate  at  a  given  testing  point.  Work  can  be  done  to  determine  which  training  points 
contribute  and  which  do  not,  perhaps  similarly  to  how  KD  Trees  determine  the  relevancy 
of  points  involved  in  a  range  or  near  neighbor  query. 

Moreover,  since  the  proposed  density  sensitive  distance  measurement  behaves  as 
a  locally  weighted  Euclidean  distance  and  since  Euclidean  distance  is  orders  of 
magnitude  faster,  a  light-weight  non-linear  regression  of  the  kernel  density  estimate  that 
applies  a  weight  to  Euclidean  distance  may  greatly  increase  the  speed  and  utility  of  this 
density  sensitive  distance  measurement  while  minimally  impacting  its  accuracy. 
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VII.  APPENDIX 


A,  THEOREM:  THE  NORMALIZED  FIRST  APPROXIMATION  TO  THE 
TAYLOR  SERIES  EXPANSION  OF  THE  UPPER-HALF  OF  THE  n  + 1 

DIMENSIONAL  ELLIPSOID  CENTERED  AT  (^,...,^^,0)  AND 

ROTATED  IN  A  HYPERPLANE  RESTRICTED  TO  THE  FIRST  n 
DIMENSIONS  IS  THE  PROBABILITY  DENSITY  FUNCTION  OF  THE 
MULTIVARIATE  NORMAL  DISTRIBUTION 

To  prove  this  we  will  first  prove  this  for  the  n  +  1  dimensional  axis-aligned 
ellipsoid  and  then  extend  this  to  any  n  +  1  dimensional  ellipsoid  that  has  been  rotated  on 
a  n  dimensional  hyperplane. 

1,  The  Upper-Half  of  the  n  +  1  Dimensional  Axis-Aligned  Ellipsoid 

Let  x^,...,x^,y  be  variables  aligned  to  the  n  -\-l  axes  of  .  The  equation  for 
an  axis-aligned  ellipsoid  eentered  at  where  G  M  for  j  =  1, ...,n  with 

radii  (cr^,...,cr^,cr^^J  along  eaeh  of  those  n  4- 1  axes  where  cr^  G  M  sueh  that  cr^  >  0 
for  k  =  l,...,n  +1  is  given  by  the  following; 


Solving  for  y  ,  we  have  the  following: 
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If  we  restrict  our  focus  to  the  upper-half  of  this  axis-aligned  ellipsoid  (i.e.,  y  >0  ),  then 
we  have  the  following: 


^upper-half 

axis-aligned 

ellipsoid 


Since  lim  log(ti)  =  — cx)  and  lim  logjti )  =  —  ex  for  -ugM,  then 


from  real  analysis  from  complex  analysis 


lim  log  u  =  —  cxe .  Furthermore,  since  limlog  u  =  —  cx)  and  lim  exp  u  =  0  for 

'  ti— >^0  '  '  V—^  —  OO  '  ' 


u,v  eR  ,  then  u  =  exp^log^ujj  for  u  >  0  .  Since  u  =  exp^log^ujj 


for  u  E 


u  >  0  ,  then  we  have  the  following: 
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V  =  (7 

^upper-half  n+l' 

axis-aligned 
ellipsoid 


^\x.  —  a. 
I  _  V  ^ 


X.  —  a. 
J 


exp  log[2/„pp,,_,,,.  ]  =  exp  log  1-  ^ 

axis-aligned  \  j=l 

ellipsoid  \  \  1  ' 


W-haif  =  exp  log((T^^J  +  log  1  -  '  2' 

axis-aligned  \  j=l  CT  ■ 

ellipsoid  \  \  I  ■' 


^upper-half 

axis-aligned 

ellipsoid 


1  ^\x.  —  a, 

y  h  tf  —  ^  ^  ®xp  —  log  1  —  — - - 

'^upper-half  n-Pl  ^  ^  ^  2 

axis-aligned  j=\  CT 

ellipsoid  \  v  '  ^ 


1 Y 

«(x.-  IX.] 

2 

1- 

'  // 

Since  the  Taylor  series  expansion  of  fl^uj  at  is 

00  Ai)  'j 

•^ ( ^ )  ^  X/ -  °  {^u  —  ,  then  for  / j  =  log j  at  =  1 ,  we  have 


T^[U-U, 


(n)  =  log(n),  (n)  =  ^  log(n)  = 


for  m  =  1, 2, 3, . . . ,  and 


the  following: 
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Substituting  u 


1- 


,  we  have  the  following; 
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Hence,  the  first  approximation  to  the  Taylor  series  expansion  of  the  upper-half  of  our 
axis-aligned  ellipsoid  is  the  following: 
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y  ,  „  =  cr  , ,  exp 

'^uppcr-nalr  n+1 

axis-aligned 
ellipsoid 


^first  approximation  n-\-l  P 

to  upper-half  axis- 
aligned  ellipsoid 


We  now  parameterize  using  for  j  =  1, . . . ,  n  such  that  the  first 

approximation  to  the  Taylor  series  expansion  of  the  upper-half  of  our  axis-aligned 
ellipsoid  is  normalized.  In  order  for  ^  to  be  normalized,  the  "area"  under  its 

^  '^iirst  approximation  ’ 

to  upper-half  axis- 
aligned  ellipsoid 


"curve"  needs  to  be  one;  in  other  words,  we  need  the  following; 


noo  poo 

/  ■■■  y,.  ^  dx---dx  =1 

J  ~oo  ^  '^iirst  approximation  1  n 


to  upper-half  axis- 
aligned  ellipsoid 


/:  ■/: 


- r- 

/  (T,. 


dx,  ■■■dx 

1  n 


For  this  to  occur,  we  have  the  following: 
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k=l 


1 

n+l  n  n 

k^l 


Therefore,  the  normalized  first  approximation  to  the  Taylor  series  expansion  of  the 
upper-half  of  our  axis-aligned  ellipsoid  is  the  following: 


normalized 
first  approximation 
to  upper-half  axis- 
aligned  ellipsoid 


-exp 


2^fll 


O', 


k=l 


Note:  That  the  normalized  first  approximation  to  the  Taylor  series  expansion  of  the 
upper-half  of  our  axis-aligned  ellipsoid  is  the  probability  density  function  of  the  axis- 
aligned  multivariate  Normal  distribution. 


Let  V  (for  variances)  be  the  diagonal  matrix  created  from  the  square  of  the  first 
n  radii  such  that  we  have  the  following: 
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V 


0  0 
0  0 
0  0 


then  V  has  the  following  properties: 


1) 


k=l 


2) 


1 

^^2  n 


k=l 


IK  =n^ 


k=l 


3)  V-'  = 


0 

0 


0  0 
0 

0  4 

a 


where 


>  0  sinee  cr  >  0  for  A:  =  1, . . . ,  n  , 


With  these  properties  in  mind,  we  ean  re-write  the  normalized  first  approximation 
to  the  Taylor  series  expansion  of  the  upper-half  of  our  axis-aligned  ellipsoid  as  the 
following: 


normalized 
first  approximation 
to  upper-half  axis- 
aligned  ellipsoid 


-exp 


2nf  Yl 


a, 


k^l 


yexp 


27r 


X  —  p, 


where  x  =  j  and  ft  =  . . . ,  j  ;  in  other  words,  x  and  ft  are  column 

vectors.  Moreover,  since  all  the  radii  are  axis  aligned,  then  there  is  no  covariance  and 
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V  =  S  .  ^  where  S  .  ^  is  the  axis  aligned  eovarianee  matrix  (i.e.,  positive  non- 

zero  values  only  on  the  diagonal);  henee,  we  ean  re-write  the  normalized  first 
approximation  to  the  Taylor  series  expansion  of  the  upper-half  of  our  axis-aligned 
ellipsoid  as  the  following: 

^/normalized  =  - ^ - T  ^mds-aligncd''  ) 

first  approximation  ,  v  ^  2  '  ^ 

to  upper-half  axis-  (StTI^ 

aligned  ellipsoid  \  ]  axis-aligned 

2,  The  Upper-Half  of  the  n  +  1  Dimensional  Rotated  Ellipsoid 

Sinee  the  equation  for  our  n  +  1  dimensional  axis-aligned  ellipsoid  is  the 
following: 


then  the  upper-half  axis-aligned  ellipsoid  ean  be  re-written  as  the  following: 

n-hl 

y 

^upper-half 
axis-aligned 
ellipsoid 

Let  R  be  the  rotation  matrix  that  rotates  6  radians  in  the  hyperplane  spanned 

by  and  x^  basis  veetors  where  x.,  x^  G  M" ,  z  =  1, . . . , n  ,  j  =  1, . . . ,  n  ,  and  i  ^  j ; 
henee,  we  have  the  following: 
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R 


1  0  ••• 

0  1  0 

:  0 


cos(0 


0 

0  0 


cos 


e  +  - 
2 


sin 

e  +  - 

2 

/ 

0  0 
0 


0  ; 

0  1  0 

•••  0  1 


As  a  rotation  matrix,  R  has  the  following  properties: 


hj 


1)  R  is  an  orthogonal  matrix 


2)  R  ^  =  R  ' 


3) 


R 


=  ±1 


Let  R  represent  all  possible  rotations  in  R” ,  then  we  have  the  following: 


^  ^1,2  '  '  '  '  '  '  ^2,k^3,4  '  '  ' 

g  V  notation  matrices 

Z _ ^  Z _ j  9  9 

t=l  j=i+l 


Moreover,  R  has  the  following  properties: 

1)  Sinee  the  produet  of  orthogonal  matriees  is  an  orthogonal  matrix,  then  R  is  an 
orthogonal  matrix 
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2)  Since  R  =  ^ ’  Ri.„R2,3 ' '  ’  I^2.nl^3.4  •  •  •  R'3.n  ' '  ’  following: 


{K- 

■^l,n^2.3 

■■■^2,n^3,4 

CO 

n—l,n 

■••R^  ••• 

o,n 

R34R2  ••• 

3,4  2,n 

R^^Rf  • 

2,3  l,n 

'■■K 

••• 

o,n 

tr't:' 

2,3  l,n 

(ft.,.- 

•  R  R 

l,n  2,3 

^2,n^3,4 

...i 

o,n 

n—l.n 

=  R  ' 

3)  Since  I  =  „i3_4  •  •  •  ,  then  we  have  the  following: 


With  these  properties  in  mind,  we  can  rotate  our  upper-half  ellipsoid  around  its 
center  by  multiplying  ^  •  •  •  ^3,4^^  ' '  ’  ’  ^^2  =  by  the  column 

vector  —  fij ;  hence,  we  have  the  following: 


^upper-half 
axis- aligned 
ellipsoid 


^upper-half 

rotated 

ellipsoid 


n+1 


n+1 


a 


n-hl ' 


a 


n+1 


1 

1 — 1 

(x-p) 

T  i  , 

1  v-1 

(*-p; 

(r(x 

-p)) 

v-‘|a''(x-p)j 

1-1 

(x-p) 

r(r 

ff-.( 

R^j(x-fl) 

1-1 

(x-p) 

r(a- 

4  v-i 

1 

1 

a 


n+1 ' 


a 


n+1 


1-1 

(x-p) 

T  , 
1  1 

;r) 

[ 

11> 

1 

(a-) 

(x-p) 

1-1 

(x-p) 

1  1 

-1 

( 

1 

1x1 

) 
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Example:  Let  be  variables  aligned  to  the  3  axes  of  then  the  equation  for  an 

axis-aligned  ellipsoid  eentered  at  1,— 2,oj  with  radii  (2,3,lj  along  eaeh  of  those  3 
axes  is  the  following: 


For  the  upper-half  of  this  axis-aligned  ellipsoid,  we  have  the  following: 


^upper-half 

axis-aligned 

ellipsoid 


and 


^upper-half 

axis-aligned 

ellipsoid 


whieh  produee  the  following  graph  (perspective  and  top  views,  respectively): 
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If  we  rotate  our  upper-half  ellipsoid  around  its  center  hy  6  =  —  radians  in  the  x  -  x 

6 

hyperplane,  then  we  have  the  following: 


cos^^j 

sin^^j 

cos^^j 

sinf^) 


cos 


sin 


e  + 


e  + 


—  sin(^0 
cosf^) 


TT 

2 

TT 


2 


^  _1 
2  2 

1  ^ 

2  2 


Moreover,  R  =  the  rotated  upper-half  ellipsoid  is  the  following: 


^upper-half 

rotated 

ellipsoid 


-(-l) 

f 

1 

/„\2 

0 

1 

r] 

/  -  \1^ 

1- 

2 

2 

2 

2 

2 

-(-i) 

1 

\  / 

0 

(3)’ 

1 

-  (-2)1 

.  2 

2  . 

.  2 

2 

and 
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V 

^  rotated 
upper-half 
ellipsoid 


which  produces  the  following  graph  (perspeetive  and  top  views,  respectively): 


^1 


Similarly,  we  can  also  rotate  the  first  approximation  to  the  Taylor  series  expansion  of  the 
upper-half  of  our  axis-aligned  ellipsoid;  hence,  we  have  the  following: 


t/r  ,  ■  ,  exp 

first  approximation  n-1-1 

to  upper-half  axis- 
aligned  ellipsoid 


V 

^  first  approximation 
to  upper-half 
rotated  ellipsoid 


^.+1  exp 


-^(x-hf  (x-fi 


--fx-pf  (RVR^ 


X  —  p, 
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Moreover,  we  ean  also  rotate  the  normalized  first  approximation  to  the  Taylor  series 
expansion  of  the  upper-half  of  our  axis-aligned  ellipsoid;  henee,  we  also  have  the 
following: 


^normalized 
first  approximation 
to  upper-half  axis- 
aligned  ellipsoid 


^normalized 
first  approximation 
to  upper-half 
rotated  ellipsoid 


-exp 


27V 


(x-fl 


—  exp 


27V 


hx-(i)  IRVR'I  (x-(i 


The  full  eovarianee  matrix  S  (i.e.,  the  eovarianee  matrix  that  is  not  neeessarily 
axis-aligned)  ean  be  reeovered  from  the  following  singular  value  deeomposition: 


BV  ,  =  S 

ordered 


=  ±1  and  V  ,  .  is  V  with 

ordered 


where  B  (for  basis)  is  an  orthogonal  matrix  sueh  that  B 

the  squared  radii  (i.e.,  the  varianees)  ordered  along  the  diagonal.  Thus,  we  need  to  find 
B  to  reeover  S  . 


In  order  to  find  B ,  we  now  turn  to  row  and  eolumn  swapping  elementary 
matriees.  Let  E  and  E  ,  denote  the  w  -th  row  and  eolumn  swapping  elementary 

row  column  i  i  c?  j 


row 

swap 


swap 


matrix,  respeetively.  Row  and  eolumn  swapping  elementary  matriees  ean  be  used  to 

order  values  along  the  diagonal  of  a  matrix.  Since  we  need  to  order  V ,  then  we  will  use 
elementary  matrices  to  accomplish  that  task;  hence,  we  have  the  following: 


V  ,  =  E  •••E  VE  ,  •••E  , 

ordered  row  row  column  column 

swap  swap  swap  swap 

9  1  1  9 


where  E  =  E  ,  for  u;  =  1, . . . ,  o  . 

rnwr  m  nmn  ■' 


column 
swap  swap 
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Example:  To  order  the  squared  radii  along  the  diagonal  of  the  following  matrix: 

(4)'  0  0  0 

0  (2)'  0  0 

0  0  (1)'  0 

0  0  0  (3)' 

We  use  the  following  elementary  matrices: 


1 

0 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

and 

0 

0 

0 

1 

0 

0 

0 

1 

0 

0 

1 

0 

0 

0 

1 

0 

0 

1 

0 

1 

such  that  we  have  the  following: 


[1 

0 

0 

o| 

( 

1 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

(^r 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

0 

(if 

0 

0 

1 

0 

0 

1 

0 

0 

0 

0 

0 

swap  row  3 
and  row  4 

\ 

swap  row  2 
and  row  4 

/  \2 

1 

0 

0 

0 

(4) 

0 

0 

/  \2 

0 

1 

0 

0 

0 

3 

0 

0 

0 

0 

1 

0 

■  0 

(if 

0 

0 

1 

0 

0 

0 

0 

swap 

row  3 

and  row  4 


0 

0 

0 


1  0  0  0  1  0  0  0 

0  0  0  1  0  1  0  0 

0  0  1  0  0  0  0  1 

0  1  0  0  0  0  1  0 

swap  column  2  swap  column  3 

and  column  4  and  column  4 


0  [1  0  0  0 
0  0  10  0 
0  0  0  0  1 

(2)M°  Q  ^ 

swap  column  3 
and  column  4 


(4)"  0  0  0 

0  (3)'  0  0 

0  0  (2)'  0 

0  0  0  (1)' 


Row  swapping  and  column  swapping  elementary  matrices  also  have  the  following 
properties: 
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1) 


E  and  E  ,  are  orthogonal  matrices  with  components  of  value  0  or  1  only 

row  column  c?  i  ^ 

swap 


column 

swap 


2)  E  =  E^  =  E^^  and  E 

'  row  row  row 


row 

swap 


row 

swap 


row 

swap 


column 

swap 


=  E' ,  =  E^ 

column  column 


swap 


swap 


Since  V 


ordered 


=  E  E  VE 


row 

swap 


row 

swap 

1 


column 

swap 

1 


•••E 


column 

swap 

q 


E 


row 

swap 


E 


row 

swap 


and  E  ,  =  E  , 

column  column  ' 


then  we  have  the  following: 


E  ••■E  VE  ,  •••E  , 

row  row  column  column 


=  V 


ordered 


swap  swap 

q  1 


swap 

1 


swap 

q 


_  _  _  q  ^  _  _ 

q 

E-i  ...E^i  E  VE  , 

row  row  row  row  column 

^column 

= 

row 

row 

^ordered 

swap  swap  swap  swap  swap 

1  9  9  1  1 

swap 

q 

swap 

1 

swap 

q 

q  _  _  q  _  _ 

q 

q  _ 

VE  ,  ,  E-] 

column  coltimn  column 

^column 

=  E^^ 

row 

row 

"^ordered 

^  column 

^  column 

swap  swap  swap 

1  q  q 

swap 

1 

swap 

1 

swap 

q 

swap 

q 

swap 

1 

q 

q 

q  _ 

q  _ 

V 

=  E^^ 

row 

row 

^ordered 

^  column 

^  column 

swap 

1 

swap 

q 

swap 

q 

swap 

1 

q  _ 

q  _ 

V 

=  E 

row 

..■E 

row 

"^ordered 

^  column 

^  column 

swap 

1 

swap 

q 

swap 

q 

swap 

1 

q 

q 

Moreover,  since 


S 


axis-aligned 


V ,  then  we  have  the  following: 


axis-aligned 


V 


row 

•  •E 

row 

"^ordered 

^  column 

^  column 

swap 

1 

swap 

q 

swap 

q 

swap 

1 

q 


q 
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Since  we  are  reordering  the  squared  radii  (the  varianees)  along  the  diagonal  of  V , 


then  E  =  E  ,  for  w  =  1, . . . ,  g  ,  so  we  will  drop  the  "row  swap"  and  "eolurnn  swap" 

row  column  7  7  j.  y  1  1  1 

swap  swap 

w  w 


labels  and  just  write  E^  such  that  we  have  the  following: 


S  .  ^  =E  •••E  V  ^  E  •••E, 

axis-aligncd  1  q  orderea  q  1 


Moreover,  sinee  E  =  E  ^  ,  then  we  have  the  following: 


axis-aligncd 


E,  •••E  V  ,  E  •••E, 

1  q  ordered  q  1 


E  •••E, 

q  1 


=  IE,-E  Jv,,.,.4e;...e; 


=  |E,-E  JV,,,,,4E^-E 


If  we  let  E  •  •  •  E  =  B  .  . 

1  q )  axis-aligned  ' 


then  B  .  Ms  an  orthogonal  matrix  with 

axis-aligned 


B 


axis-aligned 


=  ±1  and  we  have  a  singular  value  deeomposition 


^  axis-aligned  ^  axis-aligned  "^ordered  ^  axis-aligncd 


where  the  diagonal  of  V  ^  ^  are  the  eigenvalues 

ordered 


(i.e.,  the  squared  radii  or  varianees)  and  are  the  corresponding  eigenveetors 

(i.e.,  the  veetors  along  whieh  the  radii  are  aligned). 

Since  we  ean  also  "rotate"  V  using  RVR^  where  R  is  as  previously  defined, 
then  we  also  have  the  following: 


RVR'  =RS  ,  R' 

axis-aligned 


R  E  •••E  V  ,  JE  •••E  R^ 

1  q  )  ordered  \  1  q  j 


R1E,-E  J  V,,e.4RlE,-E 
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Since  the  singular  value  decomposition  of  S  is  BV  ,  where  B  is  not  necessarily 

^  oraered 

axis-aligned,  then  we  have  that  B  =  R  ®  J  •  Moreover,  sinee 
R  =  ^1,2  •  •  •  Ri.nR2,3  •  •  •  KnKi ' ' '  Kn ' ' '  ’  ^0  we  have  the  following: 


B 


^1,2  ■  ■  ■  ^l,n^2,3 


•R  R  . 

2,n  3,4 


R,  R  , 

3,n  n—l,n 


Thus,  the  eigenveetors  assoeiated  with  the  covariance  matrix  S  are  a  result  of  the 
"rotation"  and  ordering  of  the  eigenvalues  (i.e.,  the  ordered  squared  radii  or  the 

varianees).  Hence,  the  eovarianee  matrix  S  =  BV  ,  ,B^  =  RVR^  . 

'  ordered 

Therefore,  the  n  -f  1  dimensional  upper-half  ellipsoid  that  is  rotated  in  a  n  - 
dimensional  hyperplane  about  its  eenter  is  the  following: 

^upper-half 
rotated 
ellipsoid 


The  first  approximation  to  the  Taylor  series  expansion  of  the  upper-half  of  our  rotated 
ellipsoid  above  is  the  following: 


^first  approximation  n-fl  P 

to  upper-half 
rotated  ellipsoid 


i(x-fif  i-‘(x-fi) 


The  normalized  first  approximation  to  the  Taylor  series  expansion  of  the  upper-half  of 
our  rotated  ellipsoid  above  is  the  following: 
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1 


^normalized 
first  approximation 
to  upper-half 
rotated  ellipsoid 


—  exp 


27rj2 

1 


RVR' 

—  exp 


(RVR^  I  ^x-^l 


27r 


Therefore,  the  normalized  first  approximation  to  the  Taylor  series  expansion  of 
the  upper-half  of  the  n  +  1  dimensional  ellipsoid  centered  at  . . . ,  oj  and  rotated  in 

a  hyperplane  restricted  to  the  first  n  dimensions  is  the  probability  density  function  of  the 
multivariate  Normal  distribution. 


B,  COROLLARY:  THE  PROBABILITY  DENSITY  FUNCTION  OF  THE 
MULTIVARIATE  NORMAL  DISTRIBUTION  CENTERED  AT 

ft  =  (^  WITH  NON-SINGULAR  COVARIANCE  S  IS  BOUNDED 

BELOW  BY  THE  SECOND  OR  HIGHER  APPROXIMATION  TO  THE 
TAYLOR  SERIES  EXPANSION  OF  THE  UPPER-HALF  OF  THE  n  + 1 

DIMENSIONAL  ELLIPSOID  WITH  IDENTICAL  COVARIANCE  S 
CENTERED  AT  (^^,...,^^,0)  AND  MULTIPLIED  BY  THE  SCALAR 

l/((27rr^'  sD. 


The  probability  density  function  of  the  multivariate  Normal  distribution  is  the 
following: 


^pdf  of  the 
multivariate 
normal 


1 

n 

. 

[27vf 

S 

exp 


1 


-  X  -  [I 
2^ 


The  maximum  value  of  this  function  occurs  when  x  =  fl .  When  x  =  fl ,  then 
X  —  fl  =  0  and  we  have  the  following: 
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^pdf  of  the 
multivariate 
normal 


^pdf  of  the 
multivariate 
normal  with 
X— pt=0 


- - fexp  S 

"  I  - 1-  \  2' 


27r  2  S 


1  1  ^ 

- exp  —  0  S^MO 

II  I  ^  |i  2  ' 


27r)2  |S' 

1 


Yexp(^0 


27rj2  |S' 

1 

n  I  I- 

27r')2  S 


Hence,  the  maximum  value  of  the  probability  density  function  of  the  multivariate  Normal 
distribution  is  l/((27r)  Ms|  ) . 


Hi  - 

Let  cr  ,  =  1/  u27rj  S  /  where  S  is  the  identical  covariance  of  the 

probability  density  function  of  the  multivariate  Normal  distribution  above.  Then  we  have 
the  following  scaled  Taylor  series  expansion  of  the  n  +  1  dimensional  rotated  ellipsoid: 


«  pc.  —  n. 
\  j  o 

7=1  ^  i 


=  ^„+7exp 


^upper-half  n+1 

axis-aligned 
ellipsoid 


^upper-half 

rotated 

ellipsoid 


^scaled  upper-half 
rotated  ellipsoid 


((*  -  p) 

T  ^  , 

S-' 

(x-p) 

)1 

2i  ] 

^n+i  exp  -£ 


1  -  (x-p,  S  '(x-fl 

- exp  — >  2 - 

+  i-r 
27r  2  sh 


Similarly,  the  maximum  value  of  this  function  occurs  when  x  =  p, .  When  x  =  p, ,  then 
X  —  p  =  0  and  we  also  have  the  following: 
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V 

^scaled  uppcr-hali 
rotated  ellipsoid 


V 

^scaled  upper-hall 
rotated  ellipsoid 
with  X— pl=0 


27V 


exp 


27V 


exp 


27V 


exp 


-E 


-E 


'S  2! 


X  —  fl]  S  ^  (x  —  p, 


2i 


0  S-MO 


2i 


27V 


1 


S 


1 

2 


Hence,  at  x  =  ft ,  we  have  v  —  V  ,  ^  u which  implies  that  we  also  have 

’  '  ^  '^pdi  or  the  '^scaled  upper-halt  ^ 

multivariate  rotated  ellipsoid 

normal  with  with  x— [1=0 

X— fL=0 


V  ^  V 

^ pdf  of  the  —  ^scaled  upper- half 

multivariate  rotated  ellipsoid 

normal  with  with  x— fL=0 

X— fL=0 


Since  S  is  non-singular,  then  S  ^  exists  and,  from  the  details  of  the  previous 

theorem,  S  =  RVR^  where  R  is  a  composite  of  rotation  matrices  and  V  is  a  diagonal 
matrix  with  entries  corresponding  to  the  squared  radii  of  the  axis-aligned  ellipsoid  (i.e., 
the  radii  of  the  ellipsoid  prior  to  its  current  rotated  state).  Since  all  these  squared  radii  are 


positive,  then 


>  0 .  Since 


XY 


>  0 ,  and 


R 


=  ±1 ,  then 


we  have  the  following: 
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RVR^ 

R 

V 

R 

R 

V 

R 

V 

R 

2 

- 

/ 

\2 

V 

V 

>  0 


Since 


>  0  and  (x  —  ft)  (x  —  fl)  >  0 ,  then  (x  —  fl)  S  ^  (x  —  fl)  >  0 


/  /  \  x-fl  S  x-fl 

^x  —  fij  S  ^x  —  (Ij  >  0 ,  then  ^ - —  >  0  for  all  i  = 

<  0  for  all  z  =  1, 2, 3, . . . .  Furthermore,  since 

<  0  for  all  z  =  1,2, 3,...  and  exp^uj  <  1  for 
u  <0  ,  then  we  have  the  following; 


((x-fl)^S  '(x-fl)) 
2i 

2i 


.  Since 


1, 2, 3, . . . ;  hence, 


G  M  where 
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^scaled  upper-half 
rotated  ellipsoid 


1  -  x-fl  S  '  x-fl 

- - 


27r  P  S 


yexp 


x-fl)  S  (x-fijj  ^ 


27r  2  S 


yexp 


X  —  fl  S  X  —  p, 


X  —  fl  S  X  —  fl 


27r  2  S 


1  (x-p  S  (x-p 

< - rexp - - 


Finally,  since 


scaled  upper-half  —  1 

rotated  ellipsoid  ,  \  —  I  ^  lo 

27r  2  S 


-exp 


X  —  ft  S  X  —  fl 


^pdf  of  the 
multivariate 
normal 


"j  "j  rj^  — ^ 

- rexp - (x  —  pj  [x  —  pj  ,  then  we  have  the  following: 

. "  I  - 1-  I  2  '  ^  '  d 


27r  2  S 


V  r  ^ 

'^scaled  upper-half  — 
rotated  ellipsoid 


yexp 


X  —  p]  S  ^  (x  —  p 


27r  2  S 


- — r®^P  '(x-fi 

"  I  ^  It  2 


27r  2  S 


^pdf  of  the 
multivariate 
normal 
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Therefore,  since  v  ,  ^  .  ,1,  ,  then  the  probability  density  function  of 

’  '^scaled  tippcr-nalr  —  '^pdr  or  the  ’  ^  j  j 

rotated  ellipsoid  multivariate 

normal 


the  multivariate  Normal  distribution  centered  at  ft  =  j  with  non-singular 


covariance  S  is  bounded  below  by  the  second  or  higher  approximation  to  the  Taylor 
series  expansion  of  the  upper-half  of  the  n  -f  1  dimensional  ellipsoid  with  identical 

n/2  I  |V^ 


covariance  S  centered  at  and  multiplied  by  the  scalar  l/ \(27r. 


Example:  Let  be  variables  aligned  to  the  2  axes  of  ,  then  the  equation 
for  the  probability  density  function  of  the  multivariate  Normal  with  =  0  and  cr^  =  1 
is  the  following: 


1 


^pdf  of  this 
normal 


27rj^ 

1 


1  1 


exp 


exp 


The  scaled  Taylor  series  expansion  of  the  2  dimensional  ellipsoid  is  the  following: 


scaled  upper-half 
ellipsoid 


1 


1. 

27r'l2 


1  ^ 


exp 


-E 


E 


2i 


exp 


-E 


1 

2i 


If  we  plot  the  probability  density  function  of  this  Normal  on  the  same  graph  as  the 
scaled  second,  third,  fourth,  and  fifth  approximations  of  the  Taylor  series  expansion  of 
the  upper-half  of  the  2  -dimensional  ellipsoid,  then  we  have  the  following: 
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y  y 


where  the  red  eurve  is  the  probability  density  function  of  this  Normal  and  the  blue  curve 
is  the  second,  third,  fourth,  and  fifth  approximations  of  the  Taylor  series  expansion  of  the 
upper-half  of  the  2  -dimensional  ellipsoid,  respectively. 

Note  that  as  we  approach  infinity,  we  will  have  the  following; 
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vV 
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0.1 

-3  -2  -1 

- 

Example:  Let  x^,x^,y  be  variables  aligned  to  the  3  axes  of  then  the 
equation  for  the  probability  density  funetion  of  the  multivariate  Normal  with  mean 

fl  =  (1,2],  varianees  =  (l)  =1  and  cr^  =  [s]  =9,  and  rotation  6  =  —  is  the 


following: 


S  = 


cos^^j  cos 
sinful  sin 


9  +  - 
2 

e  +  - 
2 


<  0 
0  ct' 


cos^^j  cos 


sin  6  sin 


e  +  - 
2 

e  +  - 
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cos 


sm 


■sm 


cos 


TT 
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TT 
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7  -2V3 

-2V3  3 


1  0 
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TT 
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^pdf  of  this 
normal 
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The  scaled  Taylor  series  expansion  of  the  upper-half  of  the  3  -dimensional  ellipsoid  is  the 
following; 


^scaled  upper-half 
rotated  ellipsoid 


1  ~  x-fl  S  x-fl 

- - 


27r  P  S 
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3  Vs 
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2i  J 

If  we  plot  the  probability  density  function  of  this  Normal  on  the  same  graph  as  the 
scaled  second,  third,  and  fourth  approximations  of  the  Taylor  series  expansion  of  the 
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upper-half  of  the  3  -dimensional  ellipsoid,  then  we  have  the  following  (two  perspeetive 
plots  per  approximation  with  the  seeond  to  fourth  approximation  displayed  from  top  to 
bottom): 
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where  the  outer  red  eurve  is  the  probability  density  function  of  this  Normal  and  the  inner 
blue  curve  is  the  second,  third,  and  fourth  approximations  of  the  Taylor  series  expansion 
of  the  upper-half  of  the  3  -dimensional  ellipsoid,  respectively  from  top  to  bottom. 

Note  that  as  we  approach  infinity,  we  will  have  the  following; 


where  the  outer  red  curve  is  the  probability  density  function  of  this  Normal  and  the  inner 
blue  curve  is  the  Taylor  series  expansion  of  the  upper-half  of  the  3  -dimensional  ellipsoid 
as  its  approximation  approaches  infinity. 
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